behavior_version¶
Instead of maintaining different versions explicitly,
RETURNN has a behavior_version parameter to specify
the default behavior of the configuration,
as well as restricting the usage of legacy parameters/options.
This “version” number is not related anyhow to new features or bugfixes, which are always added for all behavior versions. Still, it may change the default parameters of certain features such as of available layers.
Each version change will be explicitly document here,
and the documentation is indented to always reflect
the recommended usage with the latest behavior_version,
and not listing legacy/deprecated parameters.
Version History¶
Behavior version 26 (2026-06-11)¶
DistributeFilesDataset with sharding
(any _num_shards > 1: via distrib_shard_files=True
or via the _num_shards/_shard_index properties):
the sub-epoch dataset no longer inherits the sharding info
(_num_shards/_shard_index) from the parent dataset.
Before, the sub-epoch dataset sharded its seq order a second time,
on top of the file-level sharding,
so every rank consumed only 1/num_shards of its own file shard
(1/num_shards^2 of the data, union 1/num_shards across ranks) –
silent data loss.
The fixed behavior is also available independent of the behavior version
via the new sharding_fix=True dataset option
(a new option, so an older RETURNN rejects it loudly,
rather than silently doing the wrong thing as a special flag value would);
sharding_fix=False keeps the legacy behavior.
See issue #1738.
Behavior version 25 (2026-03-01)¶
RF scatter with custom fill_value,
and related also scatter_mean and scatter_logsumexp:
Indices of padded area could incorrectly influence the result.
This is fixed.
There is also the global config option rf_scatter_use_fixed_masking: bool
to explicitly enable the new behavior.
See issue #1815.
Behavior version 24 (2025-03-02)¶
RF conv and pool padding="same" with striding:
Now, will add padding left/right independent of dimension length,
i.e. also independent of batching.
There is also the global config option rf_use_consistent_same_padding: bool to overwrite this.
See issue #1693.
Behavior version 23 (2025-02-25)¶
RF Conv1d.__init__ and co, conv,
TransposedConv1d.__init__ an co, transposed_conv,
pool, max_pool, pool1d and co,
stft, window default changed:
use_mask: False → True
There is also the global config option rf_use_mask: bool to overwrite the global default
for these and also additionally for BatchNorm.__init__.
See issue #1691.
Behavior version 22 (2025-01-22)¶
Using Tensor.__bool__ is disallowed.
This also means implicit casts to bool, like in if tensor: ....
See issue #1680.
Behavior version 21 (2024-04-25)¶
RF pad and TF PadLayer defaults changed:
handle_dynamic_dims: False → True
Behavior version 20 (2024-01-05)¶
RF TransformerDecoder defaults changed:
share_embedding: False → Trueinput_embedding_scale: 1.0 → sqrt(model_dim)input_dropout: 0.0 → dropout
Behavior version 19 (2023-11-13)¶
RF SelfAttention (and derived classes)
and RF dot_attention:
The attention dropout (via att_dropout option)
broadcasted over all but the reduced-time dimension before,
which is very likely not what you want.
Now with behavior version 19, the attention dropout does not broadcast.
You can also explicitly get the new behavior via the global config option
rf_att_dropout_broadcast = False.
Behavior version 18 (2023-09-02)¶
TF WindowLayer returns an optimized dimension order by default.
This is the dimension order which is used anyway internally.
The old behavior was to reshuffle the dim order to the original input order.
There should not be any reason to use the old behavior
(please report it if you think otherwise),
so the flag to control this is considered internal (_use_opt_dim_order).
Behavior version 17 (2023-04-19)¶
TF ZoneoutLSTMCell used the wrong output,
which was different from h
(it was actually the original output without zoneout),
so it was not as specified in the Zoneout paper,
and likely suboptimal.
A new flag use_zoneout_output was introduced
to switch between both behaviors.
Setting it to True enables the correct behavior
and makes it consistent with the paper.
Setting it to False enables the old incorrect behavior.
With behaviour version 17,
the default changed to use_zoneout_output=True.
If you want to get the old behavior with a new behavior version,
just set use_zoneout_output=False.
See issue #1313.
Behavior version 16 (2022-11-11)¶
Dim.is_equal is more restrictive and does not give equality
for different user-generated tags,
or also when comparing user-generated to auto-generated tags.
This should rarely have an effect for you.
For TF layers:
It might break when you mix n_out and then later also have a different
own dim tag for the same dim.
In that case, they will not match because the tag is different.
In such cases, just make use of dim tags consistently, i.e. use out_dim,
or specify the dim tag directly in a shape.
Behavior version 15 (2022-11-08)¶
The TFNetwork.global_train_step should deterministically
always return the initial value of the current step.
Before, it could non-deterministically return the current step or the next step.
This has an influence on any code which makes use of it.
The biggest effect is on gradient accumulation
where the old non-deterministic behavior likely was wrong.
See issue #1205.
Behavior version 14 (2022-10-19)¶
The dim matching in TF DotLayer is now more strict
for the case that var1 and var2 are not provided,
to figure out the common dims.
If this causes problems to you,
e.g. because you are not using dim tags consistently,
then just specify var1 and var2 explicitly.
See issue #1154.
Behavior version 13 (2022-10-13)¶
This enables some extra checks in the TF RecLayer which break some old configs,
where the old configs where actually broken,
but those broken parts did not play a role for the training
and thus it did not matter.
However, we don’t want to allow such broken configs anymore.
More specifically, an optimized-out output sub-layer of a RecLayer
must have the same time dim as the RecLayer itself.
For some specific transducer configs, we have this problem
(example).
This behavior version might also require
that the dim tags of extern_data are properly defined.
See issue #1140.
Behavior version 12 (2022-01-06)¶
The TF batch norm default settings have been changed. The old settings did not make much sense and almost always lead to unwanted behavior.
Specifically, the changes are:
momentum: 0.99 → 0.1update_sample_only_in_training: False → Truedelay_sample_update: False → Trueparam_version: 0 → 2 (see #898)masked_time: True → must be specified explicitly
See issue #522.
Behavior version 11 (2021-12-16)¶
Broadcasting dims no longer match in TF CombineLayer and others.
This was never needed, instead broadcasting happens in RETURNN automatically to non-existing dims.
To fix this, do not add any broadcasting dims.
See issue #666.
Behavior version 10 (2021-12-07)¶
TF ConvLayer use with_bias=True by default.
See issue #787.
Behavior version 9 (2021-12-03)¶
TF ConvLayer, PoolLayer use auto_use_channel_first=True by default.
In principle, nothing should ever change due to this when a config is correct in that nothing depends on the order of axes. However, this is now introduced as a new behavior version because older configs might depend on the order of axes. With the other behavior changes, this is mostly disallowed though, so when you make use of a higher behavior version anyway, this should be safe.
Behavior version 8 (2021-11-30)¶
TF ConvLayer, PoolLayer and TransposedConvLayer
require in_spatial_dims to be specified
when the input has more than one spatial dimension
(which implies that you perform 2D or 3D convolution or pooling).
This is required to make the order of the spatial axes well defined because the input axes could have been reordered in any way before. See issue #594.
Usually, you would use Dim to specify in_spatial_dims.
However, to make the transition easier for this specific new behavior,
you can also use a string description for a dimension.
So example usages look like:
enc_dim = Dim(...)
dec_dim = Dim(...)
in_spatial_dims = (enc_dim, dec_tim)
in_spatial_dims = ("T", "dim:16")
in_spatial_dims = ("stag:encoder", "stag:decoder")
Behavior version 7 (2021-11-29)¶
For TF layers:
Do not allow to specify axes or axis arguments in a way that depends on the order of the axes.
E.g. things like axis="spatial:1" would not be allowed.
To fix this, use dimension tags, i.e. DimensionTag instances.
To fix older configs without too much effort,
you might also want to use "stag:<name>" or "stag-single:<idx>:<name>"
or "dim:<static-dim>".
Behavior version 6 (2021-11-27)¶
TF MergeDimsLayer uses keep_order=True and does not allow keep_order=False.
There never should be a reason to use keep_order=False anyway.
If you have that, just remove it.
If that causes any problems, there is probably some other issue in your config.
See issue #654.
Behavior version 5 (2021-11-26)¶
For TF layers:
Any axis or axes argument in layers does not allow int values anymore.
Instead, use either a str like "F" or "stag:..."
or use a DimensionTag instance.
See issue #773.
Behavior version 4 (2021-11-23)¶
For TF layers: Broadcasting in all inputs simultaneously in layers and other ops is not allowed anymore by default. In all inputs simultaneously means that there is no input which has all common dimensions.
Layers can explicitly allow this by specifying out_shape.
In case you stumble upon this, specify out_shape in the layer.
See validate_broadcast_all_sources()
and issue #691.
Behavior version 3 (2021-11-08)¶
TF DotLayer: disallow int axes descriptions, remove and change defaults.
Change -1 to e.g. "static:-1" or "F".
Change -2 to e.g. "dynamic:0" or "T" or "stag:..." or dim_tag.
See issue #627.
Behavior version 2 (2021-08-27)¶
Disallow boolean optimizer specifications such as adam = True
in favor of using optimizer = {"class": "adam", ...}
See issue #512.
Behavior version 1 (2021-05-28)¶
For TF layers:
Disallow not specifying "from" in layer definition dictionaries,
thus making use of the hidden default "data" as layer input.
"from" needs to be set explicitly now.
Set it to "data" or "data:data" or some other layer or () (empty).
See issue #519.