`returnn.tf.layers.basic`¶

Many canonical basic layers.

class returnn.tf.layers.basic.SourceLayer(network, data_key=None, sources=(), **kwargs)[source]¶

This gives access to some entry from network.extern_data (ExternData).

Parameters:

network (returnn.tf.network.TFNetwork)
data_key (str|None)
sources (tuple)

layer_class: Optional[str] = 'source'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(network, data_key=None, **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
data_key (str|None)

Return type:

Data

returnn.tf.layers.basic.concat_sources(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]¶

Parameters:

src_layers (list[LayerBase])
out_dim (Dim|None)
allow_broadcast_all_sources (bool|NotSpecified)

Returns:

data with placeholders set

Return type:

Data

returnn.tf.layers.basic.get_concat_sources_data_template(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, name=None)[source]¶

This just creates a template Data instance, without creating any real TF tensors. concat_sources() (and related) are the equivalent functions which would create a Data together with the tensor.

Parameters:

src_layers (Sequence[LayerBase])
out_dim (Dim|None)
allow_broadcast_all_sources (bool|NotSpecified)
name (str|None) – name of the Data

Returns:

data with no placeholders set. it is always a copy or new instance, so safe to manipulate

Return type:

Data

returnn.tf.layers.basic.concat_sources_with_opt_dropout(src_layers, out_dim=None, dropout=0, dropout_axis=None, dropout_noise_shape=None, dropout_on_forward=False, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]¶

Concatenates in the feature dim (see concat_sources()), and then optionally applies dropout.

Parameters:

src_layers (list[LayerBase])
out_dim (Dim|None)
dropout (float) – dropout rate that will be applied if train_flag is set or dropout_on_forward is enabled
dropout_axis (Dim|str|list[Dim|str]|None)
dropout_noise_shape (tuple|list|dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – provide 1 for broadcasting or None otherwise for each axis. The default “None” will broadcast across all dynamic axes including the batch axis. Use {“*”: None} to disable broadcasting for all axes.
dropout_on_forward (bool) – apply dropout also during inference
allow_broadcast_all_sources (bool|NotSpecified)

Returns:

data with placeholders set

Return type:

Data

class returnn.tf.layers.basic.CopyLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]¶

This layer does nothing, it copies its input. This is not even a tf.identity. It refers to the same TF tensor. If multiple sources are provided, they are concatenated in the feature-dim.

Parameters:

in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.
out_dim (Dim|None) – alternative to in_dim. see in_dim doc.
extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, use IdentityLayer.

layer_class: Optional[str] = 'copy'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources=(), extra_deps=(), out_type=None, in_dim=None, out_dim=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
extra_deps (list[LayerBase])
out_type (dict[str]|None)
in_dim (Dim|None)
out_dim (Dim|None)
n_out (int|None|NotSpecified)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.IdentityLayer(sources: List[LayerBase], control_dependencies: Sequence[LayerBase] | None = None, **kwargs)[source]¶

Wraps tf.identity with potential control dependencies.

The difference to CopyLayer is that this creates a new TF op (tf.identity), which allows for potential control dependencies. This is the whole purpose of this layer.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().
sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.
reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.
name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().
param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.
only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.
rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.
encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

If False, the logic in cls_get_sub_network() will be used.
collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.
debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network
_src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'identity'[source]¶

get_dep_layers() → List[LayerBase][source]¶: deps

classmethod get_out_data_from_opts(name: str, sources: List[LayerBase], **kwargs)[source]¶: out

classmethod transform_config_dict(d, network, get_layer)[source]¶: transform

class returnn.tf.layers.basic.ConcatLayer(sources, allow_broadcast=False, out_dim=None, **kwargs)[source]¶

Concatenates the inputs in specified axes. This generalizes CopyLayer which concatenates in the feature dim.

Parameters:

sources (list[(LayerBase,str|Dim)])
allow_broadcast (bool)
out_dim (Dim|None)

layer_class: Optional[str] = 'concat'[source]¶

classmethod get_out_data_from_opts(name, sources, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[(LayerBase,str|Dim)])
out_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.DropoutLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]¶

Just the same as CopyLayer, because that one already supports dropout.

Parameters:

in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.
out_dim (Dim|None) – alternative to in_dim. see in_dim doc.
extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, use IdentityLayer.

layer_class: Optional[str] = 'dropout'[source]¶

class returnn.tf.layers.basic.ScaledGradientLayer(scale, shift=None, scale_shift_by_sum_over_axis=None, clip_max_axis=None, **kwargs)[source]¶

Just tf.identity() in the forward pass. Scales the gradient by some factor in backprop. Can be used as gradient reversal layer (with negative factor). Uses returnn.tf.util.basic.scaled_gradient(), or tf.stop_gradient()

Parameters:

scale (float|LayerBase) – if 0. and no shift, will use tf.stop_gradient
shift (float|LayerBase|None)
scale_shift_by_sum_over_axis (Dim|str|None) – if given, calculates the sum over this axis (absolute values) and multiplies the shift value by this sum.
clip_max_axis (Dim|str|None) – if given, clips the gradient to the max value in this axis before the transformation, for all values in the axis

layer_class: Optional[str] = 'scaled_grad'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.SelectSearchSourcesLayer(search_choices_layer, sources, **kwargs)[source]¶

Selects the corresponding search beams from the source, given current search choices (determined by a layer). Like InternalLayer, only for internal purpose at the moment.

Parameters:

search_choices_layer (LayerBase)
sources (list[LayerBase])

classmethod select_if_needed(layer, search_choices)[source]¶

Parameters:

layer (LayerBase)
search_choices (SearchChoices|None)

Return type:

LayerBase

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, search_choices, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
search_choices (LayerBase)

Return type:

Data

class returnn.tf.layers.basic.ActivationLayer(activation, opts=None, **kwargs)[source]¶

This layer just applies an activation function. See returnn.tf.util.basic.get_activation_function() about supported functions. Also see EvalLayer and CombineLayer for similar layers.

Parameters:

activation (str) – e.g. “relu”, “tanh”, etc
opts (dict[str]|None) – for activation function, e.g. eps for safe_log

layer_class: Optional[str] = 'activation'[source]¶

classmethod get_out_data_from_opts(activation, **kwargs)[source]¶

Parameters:: activation (str)
Return type:: Data

class returnn.tf.layers.basic.BatchNormLayer(in_dim=None, use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Implements batch-normalization (https://arxiv.org/abs/1502.03167) as a separate layer.

Also see NormLayer.

Parameters:

in_dim (returnn.tensor.Dim|None)
use_shift (bool)
use_std (bool)
use_sample (float) – defaults to 0.0 which is used in training
force_sample (bool) – even in eval, use the use_sample factor
momentum (float) – for the running average of sample_mean and sample_std
update_sample_only_in_training (bool)
delay_sample_update (bool)
param_version (int) – 0 or 1 or 2
epsilon (float)
gamma_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the scale
beta_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the mean
masked_time (bool) – flatten and mask input tensor

The default settings for these variables are set in the function batch_norm() of LayerBase. If you do not want to change them you can leave them undefined here. With our default settings:

In training: use_sample=0, i.e. not using running average, using current batch mean/var.
Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.
The running average includes the statistics of the current batch.
The running average is also updated when not training.

layer_class: Optional[str] = 'batch_norm'[source]¶

class returnn.tf.layers.basic.LayerNormLayer(in_dim=None, out_dim=None, epsilon=1e-06, **kwargs)[source]¶

Applies layer-normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see NormLayer.

Parameters:

in_dim (Dim|None) – axis to normalize over. feature-dim by default
out_dim (Dim|None) – just the same as in_dim
epsilon (float)

layer_class: Optional[str] = 'layer_norm'[source]¶

classmethod get_out_data_from_opts(sources, name, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
name (str)

Return type:

Data

class returnn.tf.layers.basic.NormLayer(axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, param_shape=<class 'returnn.util.basic.NotSpecified'>, scale=True, bias=True, epsilon=1e-06, **kwargs)[source]¶

Normalize over specified axes, e.g. time and/or feature axis.

Note: For calculating a norm, see MathNormLayer instead.

In case of just feature (axes="F"), this corresponds to layer normalization (see LayerNormLayer). In case of time and feature (axes="TF") for a 3D input, or more general all except batch (axes="except_batch"), this corresponds to group normalization with G=1, or non-standard layer normalization. (The definition of layer-normalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)

The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see BatchNormLayer).

For some discussion on the definition of layer-norm vs group-norm, also see here and here.

Parameters:

axis (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”
axes (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”
param_shape (Dim|str|list[Dim|str]|tuple[Dim|str]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the feature-dim. This is also the default, i.e. a param-shape of [F], independent of the axes to normalize over.
scale (bool) – add trainable scale parameters
bias (bool) – add trainable bias parameters
epsilon (float) – epsilon for numerical stability

layer_class: Optional[str] = 'norm'[source]¶

classmethod get_out_data_from_opts(sources, name, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
name (str)

Return type:

Data

class returnn.tf.layers.basic.MathNormLayer(p, axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, keep_dims=False, **kwargs)[source]¶

Calculates sum(abs(x) ** p) ** (1./p).

Parameters:

p (int|float)
axis (Dim|str|list[Dim|str])
axes (Dim|str|list[Dim|str])
keep_dims (bool)

layer_class: Optional[str] = 'math_norm'[source]¶

classmethod get_out_data_from_opts(name, sources, axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, keep_dims=False, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str|list[Dim|str])
axes (Dim|str|list[Dim|str])
keep_dims (bool)

Return type:

Data

class returnn.tf.layers.basic.SliceLayer(axis, slice_start=None, slice_end=None, slice_step=None, out_dim=None, **kwargs)[source]¶

Slicing on the input, i.e. x[start:end:step] in some axis. See also SliceNdLayer, for variable start. See also GatherLayer, for one single position.

Note that __getitem__ on a TF tensor (or also Numpy ND array) is more generic, and supports slices in multiple axes, as well as adding new dimensions, etc. It even allows to get boolean values, and then applies a boolean mask. See TF _slice_helper (== tf.Tensor.__getitem__) for a generic implementation, which calls tf.strided_slice. If we ever need such more generic support, we might consider adding a new layer, like GenericSliceLayer, which gets a splice_spec, just like _slice_helper (argument to __getitem__). But any such a slice can already be constructed with multiple individual layers, which perform individual slices (per axis).

We just support slicing in a single axis here, with optional striding (slice_step).

Parameters:

axis (Dim|str)
axis_kind (str|None) – “T” for time, “B” for batch, “F” for feature
slice_start (int|None)
slice_end (int|None)
slice_step (int|None)
out_dim (Dim|None)

layer_class: Optional[str] = 'slice'[source]¶

classmethod get_out_data_from_opts(name, axis, sources=(), slice_start=None, slice_end=None, slice_step=None, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
axis (Dim|str)
sources (list[LayerBase])
slice_start (int|None)
slice_end (int|None)
slice_step (int|None)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.SliceNdLayer(size, start=None, min_size=None, axis='T', out_spatial_dim=None, **kwargs)[source]¶

This takes out a slice-range from the time axis, e.g. x[start:start + size]. If the input is of shape (B,T,F) and start is of shape (B,), then the output will be of shape (B,size,F). If the input is of shape (B,T,F) and start is of shape (B,T), then the output will be of shape (B,T,size,F). This layer allows a different start slice point for each batch, in contrast to SliceLayer, and the start is variable. See also GatherNdLayer. PrefixInTimeLayer can recover the original shape (by zero-padding).

Parameters:

start (int|LayerBase|None) – (B,…)
size (int|LayerBase|Dim|None) – We assume that this is >=0. If this might not be the case, use min_size=0. If None, it uses the max possible size, and it becomes a dynamic axis.
min_size (int|None) – if size is None, but we want to have a min-size
axis (Dim|str)
out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'slice_nd'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources=(), start=None, size=None, axis='T', out_spatial_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
start (int|LayerBase|None)
size (int|LayerBase|Dim|None)
axis (Dim|str)
out_spatial_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.GatherLayer(position: LayerBase | int, axis: Dim | str, clip_to_valid: bool = False, **kwargs)[source]¶

Gathers slices on a specified axis from the input layer using indices from a position layer. If the input is a layer of the shape [B,D,F1], and position of shape [B,F2], this will yield output of the shape [B,F2,F1] where

output[b,f2,f1] = input[b,position[b,f2],f1]

(if D is the axis to gather from). In general, all shared axes of the input and the positions will be considered as batch-axes.

The position argument can also be an int. In this case, this simply gives input[position] one the specified axis.

It’s basically a wrapper around tf.gather. It provides the same functionality as the deprecated GatherNdLayer, but is more generic. See also GatherNdLayer.

Parameters:

position – indices used to select the slices of the input from. If another layer, must be of type int32 or int64. Can also specify a constant int.
axis – The axis into which we gather the indices into
clip_to_valid – if True, the indices will be clipped to the valid range of the input Also taking seq lengths into account.

layer_class: Optional[str] = 'gather'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, axis, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
position (LayerBase|int)
axis (Dim|str)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.GatherNdLayer(position, **kwargs)[source]¶

Warning: This layer is deprecated, use the more general GatherLayer instead. GatherLayer should be equivalent, but is more general (supports multiple batch dimensions, can specify gather axis) and its name is less misleading.

This takes out a position from some axis, e.g. x[pos]. This layers allows a different position for each batch. It’s basically a wrapper around tf.gather (the name of this layer is misleading). See also GatherLayer instead, which will replace this layer in the future. See also SliceNdLayer. See also ScatterNdLayer, which is the inverse operation.

Parameters:: position (LayerBase) – indices into first axis (excluding batch) of the input

layer_class: Optional[str] = 'gather_nd'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
position (LayerBase)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.ScatterNdLayer(position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, filter_invalid_indices=False, **kwargs)[source]¶

The inverse of GatherNdLayer. Mostly a wrapper for tf.scatter_nd.

Note that “nd” is maybe a bit misleading. While we operate on N-D tensors, the indices (position) are into a single new dimension.

The input to the layer are the updates, the indices are via the position argument. The indices are into the newly constructed output dimension. The output shape is constructed via the common shape of the input, the position, and the unique common axis (if not unique, we would need to introduce an option to specify it) is replaced by the given output dimension (currently via output_dim_via_time_from).

Examples:

position (indices): (B,eTs)
input (updates): (eTs,D) or (B,eTs,D) -> expanded to (B,eTs,D)
output shape: (B,eT,D)

position (indices): (B,dT,eTs)
input (updates): (eTs,D) -> expanded to (B,dT,eTs,D)
output shape: (B,dT,eT,D)

position (indices): (dT,eTs)
input (updates): (eTs,D) -> expanded to (dT,eTs,D)
output shape: (dT,eTs,D)

position (indices): (dT,eTs)
input (updates): (B,eTs,D) -> expanded to (dT,eTs,B,D)
output shape: (dT,eT,B,D)

In all these examples, output_dim_via_time_from is (B,eT,F), and eTs gets replaced by eT.

Parameters:

position (LayerBase) – indices into first axis (excluding batch) of the output
position_axis (Dim|str) – axis in position to replace by the output-dim
output_dim_via_time_from (LayerBase|None) – use the time-dim from this layer as the output-dim
out_spatial_dim (Dim|None)
filter_invalid_indices (bool) – allow for indices <0 or >= output_dim, which will be discarded in the output

layer_class: Optional[str] = 'scatter_nd'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
position (LayerBase)
position_axis (Dim|str) – axis in position to replace by the output-dim
output_dim_via_time_from (LayerBase|None) – use the time-dim from this layer as the output-dim
out_spatial_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)

class returnn.tf.layers.basic.LinearLayer(activation=None, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]¶

Linear/forward/fully-connected/1x1-conv layer. Does a linear transformation on the feature-dimension of the input with an optional bias term and an optional activation function. See also DotLayer, ElemwiseProdLayer, WeightedSumLayer.

Parameters:

activation (str|None) – e.g. “relu”, or None
with_bias (bool)
grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()
recurrent_weights_init (str) – see returnn.tf.util.basic.get_initializer()
bias_init (str|float) – see returnn.tf.util.basic.get_initializer()
use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).

layer_class: Optional[str] = 'linear'[source]¶

class returnn.tf.layers.basic.SoftmaxLayer(**kwargs)[source]¶

Just a LinearLayer with activation=”softmax” by default.

Parameters:

activation (str|None) – e.g. “relu”, or None
with_bias (bool)
grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()
recurrent_weights_init (str) – see returnn.tf.util.basic.get_initializer()
bias_init (str|float) – see returnn.tf.util.basic.get_initializer()
use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).

layer_class: Optional[str] = 'softmax'[source]¶

class returnn.tf.layers.basic.LengthLayer(axis='T', add_time_axis=False, dtype='int32', sparse=False, **kwargs)[source]¶

Returns the length of sources as (B,), via input size_placeholder.

Parameters:

axis (str|Dim)
add_time_axis (bool) – should not be used
dtype (str)
sparse (bool)

layer_class: Optional[str] = 'length'[source]¶

classmethod fixup_dim(dim, sources)[source]¶

Parameters:

dim (Dim)
sources (list[LayerBase])

Return type:

Dim

classmethod get_out_data_from_opts(name, sources, axis='T', add_time_axis=False, dtype='int32', sparse=False, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (str|Dim)
add_time_axis (bool)
dtype (str)
sparse (bool)

Return type:

Data

class returnn.tf.layers.basic.SoftmaxOverSpatialLayer(axis=None, energy_factor=None, start=None, window_start=None, window_size=None, use_time_mask=None, log_space=False, **kwargs)[source]¶

This applies a softmax over spatial axis/axes (currently only time axis supported). E.g. when the input is of shape (B,T,dim), the output will be (B,T,dim). It automatically masks the frames outside the seq defined by the seq-len. In contrast to SoftmaxLayer, this will not do a linear transformation. See SeqLenMaskLayer if you just want to apply a masking.

Parameters:

axis (Dim|str|None) – which axis to do the softmax over. “T” by default
energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
start (LayerBase|None) – Tensor of shape (B,) indicating the start frame
window_start (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window start.
window_size (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window size.
use_time_mask (bool) – if True, assumes dyn seq len, and use it for masking. By default, if dyn seq len exists, it uses it.
log_space (bool) – if True, returns in log space (i.e. uses log_softmax)

layer_class: Optional[str] = 'softmax_over_spatial'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources, axis=None, start=None, window_start=None, window_size=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str|None)
start (LayerBase|None)
window_start (LayerBase|None)
window_size (LayerBase|int|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.SeqLenMaskLayer(mask_value, axis='T', seq_len_source=None, start=None, window_start=None, window_size=None, **kwargs)[source]¶

Masks some values away given the seq_len_source with mask_value. Also see SoftmaxOverSpatialLayer. Also see SwitchLayer, which can be used to apply a generic mask.

Parameters:

seq_len_source (LayerBase|None) – if not given, uses source
axis (Dim|str)
mask_value (float)
start (LayerBase|None) – Tensor of shape (B,) indicating the start frame
window_start (LayerBase|None) – Tensor of shape (B,) indicating the window start
window_size (LayerBase|int|None)

layer_class: Optional[str] = 'seq_len_mask'[source]¶

classmethod build_mask(x, axis='T', axis_allow_int=<class 'returnn.util.basic.NotSpecified'>, seq_len_source=None, start=None, window_start=None, window_size=None)[source]¶

Parameters:

x (Data)
axis (Dim|str|int)
axis_allow_int (bool|NotSpecified) – Some callers of this function would pass in an int for axis directly. In that case, explicitly set this to True.
seq_len_source (Data|None)
start (Data|None)
window_start (Data|None)
window_size (Data|int|None)

Returns:

mask which is broadcastable to energy_data, thus you can e.g. use returnn.tf.util.basic.where_bc()

Return type:

tf.Tensor

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, start=None, window_start=None, window_size=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
start (LayerBase|None)
window_start (LayerBase|None)
window_size (LayerBase|int|None)

Return type:

Data

class returnn.tf.layers.basic.BooleanMaskLayer(*, mask: LayerBase, dims: Sequence[Dim], out_dim: Dim | None = None, **kwargs)[source]¶

Wrapper around tf.boolean_mask.

Parameters:

mask
dims
out_dim

layer_class: Optional[str] = 'boolean_mask'[source]¶

get_dep_layers() → List[LayerBase][source]¶: dep layers

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(*, name: str, sources: Sequence[LayerBase], mask: LayerBase, out_dim: Dim | None = None, **kwargs) → Tensor[source]¶

Parameters:

name
sources
mask
out_dim

class returnn.tf.layers.basic.RandomStateInitLayer(algorithm=None, seed=None, out_dim=None, **kwargs)[source]¶

This calculates the initial state value for the state var of RandomLayer. This depends on the algorithm and seed.

Parameters:

algorithm (str|tf.random.Algorithm|None) – “philox”, “three-fry”, “auto-select”. by default “philox”. See tf.random.stateless_uniform() for some documentation. “auto-select” will automatically select the optimal algorithm based on the device, so it might select a different algorithm depending on the device. Note that the state shape is dependent on the device, so if you want that checkpoints are compatible across devices, do not use “auto-select”. We take the default from tf.random.Generator.
seed (int|Sequence[int]|numpy.ndarray|None) – if given, the state will deterministically depend on this (and the algorithm) and nothing else. If you have multiple random generators (state vars), make sure that you have different seeds for each! If None (default), the seed will be deterministically taken from the network random generator at construction time, which is usually a good idea. You still can change the global network seed.
out_dim (Dim|None) – new dim tag for random state dim

layer_class: Optional[str] = 'random_state_init'[source]¶

classmethod select_algorithm(algorithm)[source]¶

Parameters:: algorithm (str|int|tf.random.Algorithm|None)
Return type:: int

classmethod get_out_data_from_opts(name, algorithm=None, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
algorithm (str|None)
out_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.RandomLayer(shape, distribution, mean=None, stddev=None, bound=None, minval=None, maxval=None, dtype='float32', sparse_dim=None, feature_dim=None, seed=None, algorithm=None, explicit_state=None, auto_update_state=None, static=None, shape_deps=(), stop_grad: bool = False, **kwargs)[source]¶

Generates random numbers from uniform or normal or truncated normal distribution.

This uses the TensorFlow stateless random ops internally, i.e. all the state handling is explicit. The state var can be explicitly provided and initialized via RandomStateInitLayer, or when not provided it will be automatically created.

There are two possible distinct use cases:

For any randomness in the model, e.g. dropout. So each session.run step will produce a new random number and advance the random state.
To initialize parameters via the config, using VariableLayer with the init_by_layer option. This will only be called once when initializing the parameters. For this use case, we do not want to keep a random state var. You can just pass static=False. Alternatively you could also pass the output of a RandomStateInitLayer as state.

Parameters:

shape (Sequence[Dim|int])
distribution (str) – “uniform”, “normal” or “truncated_normal”
mean (int|float|LayerBase|None)
stddev (int|float|LayerBase|None)
bound (int|float|LayerBase|None) – for uniform, defining the range [-bound, bound)
minval (int|float|LayerBase|None) – for uniform
maxval (int|float|LayerBase|None) – for uniform
dtype (str)
sparse_dim (Dim|None)
feature_dim (Dim|None)
seed (int|list[int]|numpy.ndarray|None) – If not given, uses self.network.random.randint, i.e. then it is controlled by the global seed setting, and every layer would get its own seed. If you specify it explicitly, make sure every RandomLayer uses a different seed, otherwise you would get the same random numbers everywhere.
algorithm (str|tf.random.Algorithm|None) – see RandomStateInitLayer
explicit_state (LayerBase|None) – You can pass the state explicitly here. If not given, will be created automatically, and updated automatically. You could pass a VariableLayer with initial value via RandomStateInitLayer, or directly a RandomStateInitLayer. If auto_update_state is True, it must be a variable, and every time a new random number is created, this variable is updated. Otherwise (default) it will not be updated automatically.
auto_update_state (bool|None) – only used when you pass an explicit state
static (bool|None) – if no state at all should be used. it just relies on the seed then.
shape_deps (list[LayerBase]) – for dyn dim tags in shape
stop_grad (bool) – if True, will stop the gradient to mean,stddev,bound,minval,maxval

layer_class: Optional[str] = 'random'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, shape, dtype='float32', sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]¶

Parameters:

name (str)
shape (Sequence[Dim|int])
dtype (str)
sparse_dim (Dim|None)
feature_dim (Dim|None)
shape_deps (list[LayerBase]) – for dyn dim tags in shape

Return type:

Data

class returnn.tf.layers.basic.RandIntLayer(shape, maxval, minval=0, dtype='int32', sparse_dim=None, seed=None, **kwargs)[source]¶

Generates random integer numbers using tf.random.uniform. It is recommended to use RandomLayer instead.

Parameters:

shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor
maxval (int|LayerBase) – upper bound (exclusive) on range of random values
minval (int|LayerBase) – lower bound (inclusive) on range of random values
dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats
sparse_dim (Dim|None)
seed (int|None) – random seed

layer_class: Optional[str] = 'rand_int'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)

classmethod get_out_data_from_opts(name, network, shape, maxval, minval=0, dtype='int32', sparse_dim=None, **kwargs)[source]¶

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor
maxval (int|LayerBase) – upper bound (exclusive) on range of random values
minval (int|LayerBase) – lower bound (inclusive) on range of random values
dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats
sparse_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.RangeLayer(limit, start=0, delta=1, dtype=None, sparse=False, out_spatial_dim=None, **kwargs)[source]¶

Generic wrapper around tf.range. See also RangeInAxisLayer.

Parameters:

limit (int|float)
start (int|float)
delta (int|float)
dtype (str|None)
sparse (bool)
out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'range'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)

classmethod get_out_data_from_opts(name, limit, start=0, delta=1, dtype=None, sparse=False, out_spatial_dim=None, **kwargs)[source]¶

Parameters:

name (str)
limit (int|float)
start (int|float)
delta (int|float)
dtype (str|None)
sparse (bool)
out_spatial_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.RangeInAxisLayer(axis, dtype='int32', unbroadcast=False, keepdims=False, sparse=False, **kwargs)[source]¶

Assume that the input is e.g. (B,T,D), and you specify axis=”T”, you will get (T,), where the specified axis is filled with tf.range. See also RangeLayer.

Parameters:

axis (str|Dim)
dtype (str)
unbroadcast (bool) – DEPRECATED, unsupported, and not needed
keepdims (bool) – DEPRECATED, unsupported, and not needed
sparse (bool)

layer_class: Optional[str] = 'range_in_axis'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, axis, dtype='int32', sparse=False, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (str|Dim)
dtype (str)
sparse (bool)

class returnn.tf.layers.basic.RangeFromLengthLayer(dtype='int32', sparse=False, out_spatial_dim=None, **kwargs)[source]¶

Given some dynamic sequence lengths as input, this creates a tf.range over the implied dimension. As a side effect, this can create a new dyn dim tag for the given sequence lengths. This side effect can be the main functionality in certain use cases. See also RangeInAxisLayer.

Consider the example:

y: {class: range_in_axis, from: x, axis: T}

This is basically equivalent to:

x_len: {class: length, from: x}
y: {class: range_from_length, from: x_len}

Parameters:

axis (str)
dtype (str)
sparse (bool)
out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'range_from_length'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, dtype='int32', sparse=False, out_spatial_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
dtype (str)
sparse (bool)
out_spatial_dim (Dim|None)

class returnn.tf.layers.basic.BatchSoftmaxLayer(**kwargs)[source]¶

Softmax over spacial and feature axis

Parameters:

in_dim (Dim|None)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_axis (Dim|str|list[Dim|str]|None)
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'batch_softmax'[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.ConstantLayer(sources, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]¶

Output is a constant value.

Parameters:

sources (list[LayerBase])
value (int|float|bool|numpy.ndarray)
shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags
dtype (str|None)
with_batch_dim (bool)
sparse_dim (Dim|None)
feature_dim (Dim|None)
shape_deps (list[LayerBase]) – for dyn dim tags in shape

layer_class: Optional[str] = 'constant'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=<class 'returnn.util.basic.NotSpecified'>, shape_deps=(), **kwargs)[source]¶

Parameters:

name (str)
value (int|float|bool)
shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags
dtype (str|None)
with_batch_dim (bool)
sparse_dim (Dim|None)
feature_dim (Dim|None|NotSpecified)
shape_deps (list[LayerBase]) – for dyn dim tags in shape

Return type:

Data

class returnn.tf.layers.basic.GatingLayer(activation, gate_activation='sigmoid', out_dim=None, **kwargs)[source]¶

Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then element-wise multiplies them. Thus, the output dimension is input-dimension / 2.

Parameters:

activation (str)
gate_activation (str)
out_dim (Dim|None)

layer_class: Optional[str] = 'gating'[source]¶

classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
n_out (int|None|NotSpecified)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.WindowLayer(window_size=None, window_dim=None, window_left=None, window_right=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]¶

Adds a window dimension. By default, uses the time axis and goes over it with a sliding window. The new axis for the window is created right after the time axis. In PyTorch, this is called unfold. We sometimes call this “chunking”. There is also the similar TimeChunkingLayer.

E.g. if the input is (batch, time, dim), the output is (batch, time, window_size, dim). If you want to merge the (window_size, dim) together to (window_size * dim,), you can use the MergeDimsLayer, e.g. {“class”: “merge_dims”, “axes”: “except_time”}.

Use stride==window_size and window_right=window_size - 1 in combination with a MergeDimsLayer to achieve feature stacking with right-hand zero padding.

This is not to take out a single window from the time-dimension. See SliceLayer or SliceNdLayer.

The inverse layer is FoldLayer.

Parameters:

window_size (int|None)
window_dim (Dim|None)
window_left (int|None)
window_right (int|None)
axis (Dim|str) – see Data.get_axis_from_description()
out_spatial_dim (Dim|None)
padding (str) – “same” or “valid”
stride (int) – return only each Nth window
_use_opt_dim_order (bool|None)

layer_class: Optional[str] = 'window'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, network, sources, window_size=None, window_dim=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]¶

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
window_size (int|None)
window_dim (Dim|None)
axis (Dim|str)
out_spatial_dim (Dim|None)
padding (str)
stride (int)
_use_opt_dim_order (bool|None)

Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, window_size=None, window_dim=None, axis='T', sources=(), **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
batch_dim (tf.Tensor)
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase)
window_size (int|None)
window_dim (Dim|None)
axis (Dim|str)
sources (list[LayerBase])

Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.basic.FoldLayer(mode: str, in_spatial_dim: Dim | str, window_dim: Dim | str, out_spatial_dim: Dim | None = None, padding: str = 'same', window_left: int | None = None, window_right: int | None = None, stride: int = 1, **kwargs)[source]¶

The inverse of WindowLayer. We sometimes call this “unchunking”. The TimeUnChunkingLayer is similar.

Input (in_spatial_dim, window_dim, other_dims…) -> output (out_spatial_dim, other_dims…).

The window_dim is folded into the out_spatial_dim. This is also similar as the PyTorch fold operation (with mode=”sum”).

Parameters:

mode – “sum” or “mean” (average), for overlapping frames
in_spatial_dim
window_dim
out_spatial_dim
padding
window_left
window_right
stride

layer_class: Optional[str] = 'fold'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name: str, sources: List[LayerBase], in_spatial_dim: Dim | str, window_dim: Dim | str, out_spatial_dim: Dim | None = None, padding: str = 'same', window_left: int | None = None, window_right: int | None = None, stride: int = 1, **kwargs) → Tensor[source]¶: out data

class returnn.tf.layers.basic.CumsumLayer(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]¶

Basically wraps tf.cumsum. Also supports that in the RecLayer.

Parameters:

axis (str) – see Data.get_axis_from_description()
additional_left_summand_per_element (str|int|float|None) – the order matters for tf.string
reverse (bool)

layer_class: Optional[str] = 'cumsum'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, axis='T', **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (str)

Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, axis='T', sources=(), **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
batch_dim (tf.Tensor)
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase)
axis (str)
sources (list[LayerBase])

Return type:

dict[str,tf.Tensor]

Adds (e.g. zero) padding in some axis or axes. Also see PrefixInTimeLayer for dynamic dims.

Parameters:

axes – e.g. “F” etc. see Data.get_axes_from_description().
padding – how much to pad left/right in each axis
out_dims
handle_dynamic_dims – True: when doing right padding on a dynamic dim, value will be added after the seq end, not at the end of the dimension. False: value will be added at the end of the dimension. By default, in behavior version >=21, this is True, in older versions, this is False.
value – what constant value to pad, with mode==”constant”
mode – “constant”, “reflect”, “symmetric” and “replication”

layer_class: Optional[str] = 'pad'[source]¶

classmethod get_out_data_from_opts(name, sources, axes, padding, out_dims=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axes (Dim|str|Sequence[Dim|str])
padding (Sequence[(int|Dim,int|Dim)]|(int|Dim,int|Dim)|int|Dim)
out_dims (Dim|Sequence[Dim]|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.MergeDimsLayer(axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, n_out=None, out_dim=None, **kwargs)[source]¶

Merges a list of axes into a single one. (Flatten the dims.) E.g. input is (batch, width, height, dim) and axes=(1,2), then we get (batch, width*height, dim). Or input is (batch, time, height, dim) and axes=”except_time”, then we get (batch, time, height*dim). See also CombineDimsLayer. When batch and time got merged, SplitBatchTimeLayer can undo this. When you want to merge batch and time, but remove the padding efficiently, i.e. flatten it, see FlattenBatchLayer.

Parameters:

axes (Sequence[Dim|str]) – see Data.get_axis_from_description()
keep_order (bool|NotSpecified) – The old default was: the axes are sorted, and then merged. Thus, the order of incoming axes will influence the result. E.g. inputs [B,S,F] and [B,F,S], with axes=["S","F"], will get different results, although the output shape is [B,S*F] in both cases. This is bad: In general, other layers in RETURNN might reorder the axes for various reasons, and all layers should behave in the same way, no matter the order. It is recommended to set keep_order=True, such that the order defined in axes defines the behavior, and not the incoming axis order. Since behavior version 6, this is already the case.
n_out (int|None)
out_dim (Dim|None)

layer_class: Optional[str] = 'merge_dims'[source]¶

classmethod get_out_data_from_opts(name, axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, sources=(), n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
axes (Sequence[Dim|str])
keep_order (bool|NotSpecified)
sources (list[LayerBase])
n_out (int|None|NotSpecified)
out_type (None|dict[str])
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.SplitLayer(axis=None, num_splits=None, size_splits=None, out_dims=None, **kwargs)[source]¶

Splits one axis into multiple parts, via tf.split. self.output is simply the input copied. Each part can be accessed via the sublayers “/%i”.

Parameters:

axis (str|None) – feature axis by default
num_splits (int|None)
size_splits (list[int]|None)
out_dims (list[Dim]|None)

layer_class: Optional[str] = 'split'[source]¶

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str)
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

classmethod get_out_data_from_opts(sources, **kwargs)[source]¶

Parameters:: sources (list[LayerBase])
Return type:: Data

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

class returnn.tf.layers.basic.SplitDimsLayer(axis, dims, pad_to_multiples=None, pad_value=0, **kwargs)[source]¶

Splits one axis into multiple axes. E.g. if you know that your feature-dim is composed by a window, i.e. the input is (batch, time, window * feature), you can set axis=”F”, dims=(window, -1), and you will get the output (batch, time, window, feature).

If the split axis has a dynamic length, exactly one of the axes that we split into need to also have a dynamic length. You can e.g. use this to split the input dimension into smaller “chunks” of a fixed window size. E.g. you could have input (batch, time, feature) and set axis=”T”, dims=(-1, window), to get output (batch, split_time, window, feature). In this case, the exact sequence lengths are lost and everything is padded to multiples of the window size using the given padding value. Use ReinterpretDataLayer to receive back the original sequence lengths after merging.

Also see SplitBatchTimeLayer. Also see MergeDimsLayer which can undo this operation.

Parameters:

axis (Dim|str) – e.g. “F”
dims (tuple[Dim|int]|list[Dim|int]) – what the axis should be split into. e.g. (window, -1)
pad_to_multiples (bool|None) – If true, input will be padded to the next multiple of the product of the static dims, such that splitting is actually possible. By default this is done iff the axis has a dynamic size
pad_value (int|float) – What pad value to use for pad_to_multiples

layer_class: Optional[str] = 'split_dims'[source]¶

classmethod get_out_data_from_opts(name, axis, dims, pad_to_multiples=None, sources=(), **kwargs)[source]¶

Parameters:

name (str)
axis (Dim|str)
dims (list[Dim|int]|tuple[Dim|int])
pad_to_multiples (bool|None)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.SplitBatchTimeLayer(base, **kwargs)[source]¶

A very specific layer which expects to get input of shape (batch * time, …) and converts it into (batch, time, …), where it recovers the seq-lens from some other layer. See SplitDimsLayer for a more generic layer.

Parameters:: base (LayerBase) – used to recover the seq-lens

layer_class: Optional[str] = 'split_batch_time'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, base, sources=(), **kwargs)[source]¶

Parameters:

name (str)
base (LayerBase)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.ReshapeLayer(in_dims, out_dims, extra_deps=(), **kwargs)[source]¶

Allows to reshape (…, in_dims, …) to (…, out_dims, …) as long as prod(in_dims) == prod(out_dims).

in_dims don’t need to be directly behind each other or in that order – internally it will permute it such that it is in the right order. out_dims should be defined.

This can be used for clever indexing, slicing, padding tricks. It can also be used as an alternative to SplitDimsLayer or MergeDimsLayer.

Parameters:

in_dims (Sequence[Dim|str])
out_dims (Sequence[Dim|str])
extra_deps (Sequence[LayerBase]) – Just add as an additional dependency, without really using it. This is to potentially define otherwise unknown out_dims.

layer_class: Optional[str] = 'reshape'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, in_dims, out_dims, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
in_dims (Sequence[Dim|str])
out_dims (Sequence[Dim|str])

class returnn.tf.layers.basic.FlattenBatchLayer(axis='T', batch_major=True, **kwargs)[source]¶

Merges one axis into the batch axis. If the axis has dynamic lengths, this would use flattening, i.e. recalculate the padding, i.e. the size changes. This basically wraps flatten_with_seq_len_mask() or flatten_with_seq_len_mask_time_major(). See also MergeDimsLayer, which does not do flattening, i.e. the size stays the same.

Parameters:

axis (str)
batch_major (bool) – if False, will flatten in time-major manner

layer_class: Optional[str] = 'flatten_batch'[source]¶

classmethod get_out_data_from_opts(sources, name, axis='T', batch_major=True, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
name (str)
axis (str)
batch_major (bool) – if False, will flatten in time-major manner

Return type:

Data

class returnn.tf.layers.basic.UnflattenBatchLayer(**kwargs)[source]¶

Inverse of FlattenBatchLayer, so recovers an axis previously merged into the batch axis

This basically wraps unflatten_with_seq_len_mask().

Parameters:

in_dim (Dim|None)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_axis (Dim|str|list[Dim|str]|None)
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'unflatten_batch'[source]¶

classmethod get_out_data_from_opts(sources, name, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
name (str)

Return type:

Data

class returnn.tf.layers.basic.UnflattenNdLayer(sizes, num_axes, in_dim='T', out_dims=None, declare_same_sizes_as=None, **kwargs)[source]¶

This keeps the batch axis as-is, i.e. the flattening/unflattening did not happen on the batch axis.

Example:

Assumes that the input is of shape (B,T,<Ds>) which represents flattened images, where each image is of size width * height. We additionally provide these image sizes (shape (B,2)), i.e. (width,height) tuples. We return the unflattened images of shape (B,W,H,<Ds>), where W/H are the max width/height.

This basically wraps returnn.tf.util.basic.unflatten_nd().

Parameters:

sizes (LayerBase)
num_axes (int)
in_dim (Dim|str|None)
out_dims (list[Dim]|None)
declare_same_sizes_as (dict[int,LayerBase]|None)

layer_class: Optional[str] = 'unflatten_nd'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, num_axes, in_dim='T', out_dims=None, declare_same_sizes_as=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
num_axes (int)
in_dim (Dim|str|None)
out_dims (list[Dim]|None)
declare_same_sizes_as (dict[int,LayerBase]|None)

Return type:

Data

class returnn.tf.layers.basic.ExpandDimsLayer(axis, dim=1, **kwargs)[source]¶

Adds some axis.

Parameters:

axis (str|int) – axis to add, e.g. “F”|”feature” or “spatial”|”time”|”T”. if this is an integer, the input data is first converted into batch-major mode, and then this is counted with batch-dim.
dim (int|Dim) – dimension of new axis (1 by default)

layer_class: Optional[str] = 'expand_dims'[source]¶

classmethod get_out_data_from_opts(name, axis, dim=1, sources=(), **kwargs)[source]¶

Parameters:

name (str)
axis (str|int)
dim (int|Dim)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.RepeatLayer(repetitions, axis='T', out_dim=None, **kwargs)[source]¶

A wrapper around tf.repeat, but supports an additional batch axis for the durations The sum of the repetitions has to be non-zero for each sequence in the batch.

This layer can only be used with Tensorflow 1.15.0 or newer.

Parameters:

repetitions (LayerBase|int) – number of repetitions for each sequence and position in target axis. Can be [B,T] or [T,B] or some subset of that shape
axis (Dim|str) – (dynamic) axis for repetition (currently only time axis is supported)
out_dim (Dim|None)

layer_class: Optional[str] = 'repeat'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, axis, repetitions, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str)
repetitions (LayerBase|int)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.TileLayer(multiples, out_dims=None, **kwargs)[source]¶

A wrapper around tf.tile

Parameters:

multiples (dict[Dim|str, int]) – number of multiples per axis (axis provided as dim tag or str desc)
out_dims (dict[Dim|str, Dim]|None)

layer_class: Optional[str] = 'tile'[source]¶

classmethod get_out_data_from_opts(name, sources, multiples, out_dims=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
multiples (dict[Dim|str, int])
out_dims (dict[Dim|str, Dim]|None)

Return type:

Data

class returnn.tf.layers.basic.CastLayer(dtype, output, **kwargs)[source]¶

Cast to some other dtype.

Parameters:

dtype (str)
output (Data)

layer_class: Optional[str] = 'cast'[source]¶

classmethod get_out_data_from_opts(dtype, **kwargs)[source]¶

Parameters:: dtype (str)
Return type:: Data

class returnn.tf.layers.basic.SwapAxesLayer(axis1, axis2, **kwargs)[source]¶

Swaps two axes. Basically a wrapper around returnn.tf.util.basic.swapaxes(). Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.

All axes always have a special meaning (e.g. feature dim or time dim) or dimension tag (e.g. for time axes, including dyn seq lengths). If you need to change the meaning (and not actually transpose / swap axes), you need to use ReinterpretDataLayer.

See also TransposeLayer for a more generic variant.

See also ReinterpretDataLayer, which does not swap/transpose axes, but allows to reinterpret their meaning / dim tags.

Parameters:

axis1 (int|str)
axis2 (int|str)

layer_class: Optional[str] = 'swap_axes'[source]¶

classmethod get_out_data_from_opts(name, sources, axis1, axis2, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis1 (int|str)
axis2 (int|str)

Return type:

Data

class returnn.tf.layers.basic.TransposeLayer(perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim], **kwargs)[source]¶

Basically a wrapper around tf.transpose().

Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.

See also ReinterpretDataLayer, which does not transpose axes, but allows to reinterpret their meaning / dim tags.

One valid use case is to use this for the final output layer, to make sure the output is in the correct format.

Parameters:: perm – target axis -> source axis

layer_class: Optional[str] = 'transpose'[source]¶

Parameters:

input_data
perm
name

Returns:

transposed data

classmethod get_perm_int(input_data: Tensor, perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim]) → List[int][source]¶

Parameters:

input_data
perm

classmethod get_out_data_from_opts(name, sources, perm, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
perm (dict[str,str]) – target axis -> source axis

Return type:

Data

class returnn.tf.layers.basic.ReinterpretDataLayer(switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]¶

Acts like the CopyLayer but reinterprets the role of some axes or data.

Parameters:

switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes
size_base (LayerBase|None) – copy the size_placeholder from the given layer
batch_dim_base (LayerBase|None) – copy the batch dim from this layer
set_axes (dict[str,Dim|str|None]) – This can be used to overwrite the special axes like time_dim_axis or feature_dim_axis. For that, use keys “B”,”T” or “F”, and a value via Data.get_axis_from_description().
set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None) – axis -> new dim tag. assigns new dim tags. If the passed dim tag is yet undefined, this will not use same_dim_tags_as (declare_same_as) but create a new dim tag. This option is useful for generalized self attention (https://github.com/rwth-i6/returnn/issues/391).
enforce_batch_major (bool)
enforce_time_major (bool)
set_sparse (bool|None) – if bool, set sparse value to this
set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse
increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse

layer_class: Optional[str] = 'reinterpret_data'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes
size_base (LayerBase|None) – similar as size_target
batch_dim_base (LayerBase|None)
set_axes (dict[str,Dim|str|None])
set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None)
enforce_batch_major (bool)
enforce_time_major (bool)
set_sparse (bool|None) – if bool, set sparse value to this
set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse
increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse

class returnn.tf.layers.basic.ConvLayer(filter_size, padding, strides=1, dilation_rate=1, groups=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, with_bias=<class 'returnn.util.basic.NotSpecified'>, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, pad_seq_len_to_power=None, **kwargs)[source]¶

A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate “pool” layer.

Parameters:

filter_size (Sequence[Dim]|Sequence[int]) – (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. The input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. It will automatically swap the batch-dim to the first axis of the input data.
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
strides (int|Sequence[int]) – strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.
dilation_rate (int|Sequence[int]) – dilation for the spatial dims
groups (int) – grouped convolution
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
auto_use_channel_first (bool|NotSpecified) – convert the input to NCHW or not
with_bias (bool|NotSpecified) – if True, will add a bias to the output features. True by default since behavior version 10.
activation (None|str) – if set, will apply this function at the end
filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter
filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)
bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias
use_time_mask (bool)
pad_seq_len_to_power (Optional[float]) – pad sequence length to power of given number to reduce number of different sequence lengths. See https://github.com/rwth-i6/returnn/issues/1450 and https://github.com/tensorflow/tensorflow/issues/62441.

layer_class: Optional[str] = 'conv'[source]¶

recurrent = True[source]¶

classmethod set_output_dim_tags(output, num_batch_dims, in_spatial_dims, out_spatial_dims, filter_size, strides, dilation_rate, padding)[source]¶

Parameters:

output (Data)
num_batch_dims (int)
in_spatial_dims (Sequence[Dim])
out_spatial_dims (Sequence[Dim]|None)
filter_size (Sequence[int|Dim])
strides (Sequence[int])
dilation_rate (Sequence[int])
padding (str|int|Sequence[int])

classmethod transform_input(input_data, network, in_dim=None, in_spatial_dims=None, input_expand_dims=0, input_split_feature_dim=None, input_add_feature_dim=False, use_time_mask=False, mask_value: float = 0.0)[source]¶

Parameters:

input_data (Data)
network (returnn.tf.network.TFNetwork)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
use_time_mask (bool)
mask_value – when use_time_mask is used, what value to use for the mask

Returns:

(transformed input, num batch dims). all batch dims are at the front

Return type:

(Data, int)

classmethod get_input_placeholder_with_same_static_padding(input_data: Tensor, num_batch_dims: int, filter_size: Sequence[int], strides: Sequence[int], out_batch_feature_major: bool) → Tensor[source]¶

Returns the placeholder of input_data with same_static padding applied to it.

Parameters:

input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
filter_size
strides
out_batch_feature_major

classmethod get_input_placeholder_with_int_padding(input_data: Tensor, *, num_batch_dims: int, out_batch_feature_major: bool, padding: int | Sequence[int], pad_value: float = 0.0) → Tensor[source]¶

Returns the placeholder of input_data with same_static padding applied to it.

Parameters:

input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
out_batch_feature_major
padding
pad_value

classmethod calc_out_dim(in_dim, filter_size, stride, padding, dilation_rate=1)[source]¶

Parameters:

in_dim (T|int|tf.Tensor|Dim) – dimension in some axis
filter_size (int|Dim) – e.g. 2, for the corresponding axis
stride (int) – e.g. 1, for the corresponding axis
dilation_rate (int) – e.g. 1
padding (str|int) – “valid” or “same”

Returns:

the output dimension

Return type:

classmethod get_out_data_from_opts(name, sources, network, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (Sequence[LayerBase])
network (returnn.tf.network.TFNetwork)
filter_size (Sequence[int|Dim])
padding (str|int|Sequence[int])
strides (int|Sequence[int])
dilation_rate (int|Sequence[int])
input_expand_dims (int) – number of dynamic dims to add to the input
input_add_feature_dim (bool)
input_split_feature_dim (None|int)
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims – number of spatial dims to add to the input
auto_use_channel_first (bool|NotSpecified)

Return type:

Data

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.PoolLayer(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, use_time_mask=False, **kwargs)[source]¶

A generic N-D pooling layer. This would usually be done after a convolution for down-sampling.

Parameters:

mode (str) – “max” or “avg”
pool_size (Sequence[int]) – shape of the window of each reduce
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
dilation_rate (Sequence[int]|int)
strides (Sequence[int]|int|None) – in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified) – if set, will transform input to NCHW format
use_time_mask (bool)

layer_class: Optional[str] = 'pool'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, network, pool_size, strides=None, dilation_rate=1, padding='VALID', in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
pool_size (Sequence[int])
strides (Sequence[int]|int)
dilation_rate (int|Sequence[int])
padding (str|int|Sequence[int])
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified)

Return type:

Data

class returnn.tf.layers.basic.DctLayer(type=2, n=None, norm=None, **kwargs)[source]¶

Layer to perform DCT Wraps tf.signal.dct(). For further documentation on the input arguments, refer to https://www.tensorflow.org/api_docs/python/tf/signal/dct

Parameters:

type (int) – DCT type to perform. Must be 1, 2, 3, or 4
n (int|None) – length of the transform
norm (str|None) – normalization to apply. Must be None or “ortho”

layer_class: Optional[str] = 'dct'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.TransposedConvLayer(filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, with_bias=True, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, **kwargs)[source]¶

Transposed convolution, sometimes also called deconvolution. See tf.nn.conv2d_transpose() (currently we support 1D/2D).

Parameters:

filter_size (list[int])
strides (list[int]|None) – specifies the upscaling. by default, same as filter_size
padding (str) – “same” or “valid”
remove_padding (list[int]|int)
output_padding (list[int|None]|int|None)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (list[Dim]|None)
with_bias (bool) – whether to add a bias. enabled by default.
activation (str|None)
forward_weights_init
bias_init
filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter
filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)
bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias
use_time_mask (bool)

layer_class: Optional[str] = 'transposed_conv'[source]¶

recurrent = True[source]¶

static deconv_output_length(input_length, filter_size, padding, output_padding=None, stride=0, dilation=1, out_dim=None)[source]¶

Determines output length of a transposed convolution given input length.

Copied from TF/Keras conv_utils.deconv_output_length (https://github.com/tensorflow/tensorflow/blob/5912f51d580551e5cee2cfde4cb882594b4d3e60/tensorflow/python/keras/utils/conv_utils.py#L140), adapted with simplification.

Also see ConvLayer.calc_out_dim().

Parameters:

input_length (T|int|tf.Tensor|Dim)
filter_size (int)
padding (str) – one of “same”, “valid”, “full”.
output_padding (int|None) – amount of padding along the output dimension. Can be set to None in which case the output length is inferred.
stride (int)
dilation (int)
out_dim (Dim|None)

Returns:

The output length (integer)

Return type:

classmethod get_out_data_from_opts(name, sources, network, filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, n_out=None, out_dim=None, out_spatial_dims=None, in_dim=None, in_spatial_dims=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
filter_size (list[int])
strides (list[int]|None)
padding (str)
remove_padding (list[int]|int)
output_padding (list[int|None]|int|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (list[Dim]|None)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)

Return type:

Data

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

class returnn.tf.layers.basic.ReduceLayer(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]¶

This reduces some axis by using e.g. “sum” or “max”. It’s basically a wrapper around tf.reduce_sum or tf.reduce_max.

Parameters:

mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (Sequence[Dim|str]) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().
axis (Dim|str) – for compatibility, can be used instead of axes
keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int|None) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

layer_class: Optional[str] = 'reduce'[source]¶

classmethod reduce(input_data, mode, axes=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None)[source]¶

Parameters:

input_data (Data)
mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (int|list[int]|str) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().
keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

Return type:

tf.Tensor

classmethod need_enforce_batch_dim_axis(axes)[source]¶

Parameters:: axes (int|list[int]|str|Dim)
Returns:: if any integer is in axes, thus we should have a fixed dimension layout
Return type:: bool

classmethod get_axes(axis, input_data)[source]¶

Parameters:

axis – see self.__init__()
input_data (Data)

Returns:

list of axes

Return type:

list[int]

classmethod get_out_data_from_opts(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
mode (str) – (default here “” because other code uses this function)
axes (str|list[str]|None)
axis (str|None)
keep_dims (bool)
enforce_batch_dim_axis (int|None)

Return type:

Data

class returnn.tf.layers.basic.ReduceOutLayer(mode, num_pieces, out_dim=None, **kwargs)[source]¶

Combination of SplitDimsLayer applied to the feature dim and ReduceLayer applied to the resulting feature dim. This can e.g. be used to do maxout.

Parameters:

mode (str) – “sum” or “max” or “mean”
num_pieces (int) – how many elements to reduce. The output dimension will be input.dim // num_pieces.
out_dim (Dim|None)

layer_class: Optional[str] = 'reduce_out'[source]¶

classmethod get_out_data_from_opts(num_pieces, sources, name, out_dim=None, **kwargs)[source]¶

Parameters:

num_pieces (int)
sources (list[LayerBase])
name (str)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.SqueezeLayer(axis, enforce_batch_dim_axis=None, allow_no_op=False, **kwargs)[source]¶

Removes an axis with dimension 1. This is basically a wrapper around tf.squeeze.

Parameters:

axis (Dim|int|list[int]|str) – one axis or multiple axis to squeeze. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”
enforce_batch_dim_axis (int|None)
allow_no_op (bool)

layer_class: Optional[str] = 'squeeze'[source]¶

classmethod get_out_data_from_opts(axis, enforce_batch_dim_axis=None, allow_no_op=False, sources=(), **kwargs)[source]¶

Parameters:

axis (Dim|int|list[int]|str)
enforce_batch_dim_axis (int|None)
allow_no_op (bool)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.StackLayer(axis=None, out_spatial_dim=None, **kwargs)[source]¶

Stacks multiple inputs together using tf.stack(). This creates a new dimension for the stack.

For concatenation (in feature dimension), see CopyLayer.

Parameters:

axis (int|None) – new axis. If not given, will use Data.get_default_new_axis_for_dim_tag(<spatial>), i.e. some reasonable default for a new spatial axis.
out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'stack'[source]¶

classmethod get_out_data_from_opts(name, sources, axis=None, out_spatial_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (int|None)
out_spatial_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.WeightedSumLayer(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]¶

Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use ElemwiseProdLayer and ReduceLayer, or just a DotLayer with a VariableLayer. See also LinearLayer.

Parameters:

axes (str|list[str]) – the axes to do the weighted-sum over
padding (str) – “valid” or “same”, in case of keep_dims=True
size (None|tuple[int]) – the kernel-size. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=”valid” by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.
keep_dims (bool) – if False, the axes will be squeezed away. see also size.

layer_class: Optional[str] = 'weighted_sum'[source]¶

classmethod get_out_data_from_opts(name, sources, axes, padding=None, size=None, keep_dims=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axes (str|list[str])
padding (str|None)
size (None|tuple[int])
keep_dims (bool|None)

Return type:

Data

class returnn.tf.layers.basic.ElemwiseProdLayer(axes, size=None, **kwargs)[source]¶

Element-wise product in some axes. Microsoft calls this “static attention”, in Deep Conv. NN with Layer-wise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also LinearLayer.

Parameters:

axes (str|list[str]) – e.g. “spatial”, but all those axes must be of fixed dimension
size (tuple[int]) – for double-checking, you can explicitly provide the size

layer_class: Optional[str] = 'elemwise_prod'[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.PrefixInTimeLayer(axis='T', out_dim=None, prefix=0.0, repeat=1, size_base=None, **kwargs)[source]¶

Adds some prefix in time dimension. This is kind of the reverse of SliceNdLayer does. Also see PadLayer for static dimensions. Also see PostfixInTimeLayer.

Parameters:

axis (Dim|str)
out_dim (Dim|None)
prefix (float|str) – either some constant or another layer
repeat (int|LayerBase) – how often to repeat the prefix
size_base (LayerBase|None) – copy seq-lens from here

layer_class: Optional[str] = 'prefix_in_time'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, size_base=None, repeat=1, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str)
out_dim (Dim|None)
size_base (LayerBase|None)
repeat (LayerBase|int)

Return type:

Data

class returnn.tf.layers.basic.PostfixInTimeLayer(axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]¶

Adds some postfix in time dimension. Also see PrefixInTimeLayer.

Parameters:

axis (Dim|str)
out_dim (Dim|None)
postfix (float|int|LayerBase) – constant or other layer without time axis to use as postfix
repeat (int) – how often to repeat the postfix

layer_class: Optional[str] = 'postfix_in_time'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]¶

Parameters:

axis (Dim|str)
out_dim (Dim|None)
name (str)
sources (list[LayerBase])
postfix (float|int|LayerBase) – constant or other layer without time axis to use as postfix
repeat (int)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

get_dep_layers()[source]¶

Return type:: list[LayerBase]

class returnn.tf.layers.basic.TimeChunkingLayer(chunk_size, chunk_step, axis='T', out_dim=None, **kwargs)[source]¶

Performs chunking in time. See returnn.tf.native_op.chunk(). See also WindowLayer and TimeUnChunkingLayer. It’s very similar to WindowLayer, but we have this case more optimized, and also it modifies the batch dim. The output is of shape (chunk_size, n_batch * n_chunks, …).

Parameters:

chunk_size (int) – chunk size or window size
chunk_step (int) – chunk step or striding
axis (Dim|str)
out_dim (Dim|None)

layer_class: Optional[str] = 'time_chunking'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.TimeUnChunkingLayer(chunking_layer, **kwargs)[source]¶

Performs chunking in time. See TFNativeOp.chunk(). See TimeChunkingLayer.

Parameters:: chunking_layer (TimeChunkingLayer)

layer_class: Optional[str] = 'time_unchunking'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, chunking_layer, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
chunking_layer (LayerBase)

Return type:

Data

class returnn.tf.layers.basic.DotLayer(reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, use_mask: bool = True, debug=False, **kwargs)[source]¶

This performs a dot-product of two sources. The underlying matmul expects shapes (shared…, I, J) * (shared…, J, K) -> (shared…, I, K). We say that J is the axis to be reduced, I is the var-dim of source 1, and K is the var-dim of source 2. I, J, K can also be multiple axes from the sources. The var-dims don’t need to exist. All other axes (shared…) are expected to match.

You should try to avoid having the same dims in both sources when they are not reduced such that you would end up having some dim twice in the output, e.g. (shared…, I, I). You should avoid this because the dim order should never matter (https://github.com/rwth-i6/returnn/wiki/RETURNN-principles). If you need to perform such an operation, you can use ReinterpretDataLayer to introduce a new dim tag.

The reduce dim can also be the sparse dim of one of the sources. In this case, it behaves like GatherLayer.

Parameters:

reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool) – if var2=None, add dim=1 at the end
use_mask – If the reduction is over dynamic axes, to get the correct sum reduction, we need to apply masking to one of the inputs. This is done automatically. By disabling this flag, this would be disabled.
debug (bool) – will print debug shapes, etc.

Earlier defaults:: red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True.
However, these are bad, for multiple reasons, like using integers, but also in general.: See https://github.com/rwth-i6/returnn/issues/627 for details.

layer_class: Optional[str] = 'dot'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool)

Return type:

Data

class returnn.tf.layers.basic.ShiftAxisLayer(axis, amount, pad=True, pad_value=0, adjust_size_info=True, **kwargs)[source]¶

Shifts the dimensions in an axis around by slicing and optional padding. This layer may change the axis-dimension.

This name might be confusing. No axis will be shifted here. See SwapAxesLayer for that.

Also see SliceLayer.

Parameters:

axis (str|Dim|int) – single axis to shift
amount (int) – number of elements to shift (<0 for left-shift, >0 for right-shift)
pad (bool) – preserve shape by padding
pad_value (int|float|bool) – padding value
adjust_size_info (bool) – whether to adjust the size_placeholder

layer_class: Optional[str] = 'shift_axis'[source]¶

classmethod get_out_data_from_opts(name, sources, amount, axis, pad=True, adjust_size_info=True, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
amount (int)
axis (str)
pad (bool)
adjust_size_info (bool)

Return type:

Data

class returnn.tf.layers.basic.ResizeLayer(factor, axis, out_dim=None, kind='nn', fill_value=None, fill_dropout=None, **kwargs)[source]¶

Resizes the input, i.e. upsampling or downsampling. Supports different kinds, such as linear interpolation or nearest-neighbor.

Parameters:

factor (int|float|LayerBase) – out_len = in_len * factor
axis (Dim|str) – the axis to resize
out_dim (Dim|None)
kind (str) – “linear”, “nn”/”nearest_neighbor”, “cubic”, “fill”
fill_value (None|int|float) – if kind==”fill”
fill_dropout (float|None) – if set, will dropout in the same axis

layer_class: Optional[str] = 'resize'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)

classmethod get_out_data_from_opts(factor, axis, sources, name, fill_dropout=None, out_dim=None, **kwargs)[source]¶

Parameters:

factor (int|float|LayerBase)
axis (Dim|str)
sources (list[LayerBase])
name (str)
fill_dropout (float|None)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.CombineDimsLayer(**kwargs)[source]¶

Combines multiple dimensions. See also MergeDimsLayer. This is deprecated in favor of MergeDimsLayer.

Parameters:: axes (int|list[int]|str) – one axis or multiple axis to reduce. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”

layer_class: Optional[str] = 'combine_dims'[source]¶

classmethod get_out_data_from_opts(**kwargs)[source]¶

Return type:: Data

class returnn.tf.layers.basic.RemoveLayer(symbol, axis='T', out_dim=None, **kwargs)[source]¶

Currently, assumes sparse data, and removes a specific symbol from the data.

It is recommended to use MaskedComputationLayer in combination with e.g. a :class:CompareLayer` instead, as this provides more flexibility.

Parameters:

symbol (int)
axis (Dim|str) – the axis to operate over, to potentially remove frames
out_dim (Dim|None) – derived from the dim of axis, the reduced new dim

layer_class: Optional[str] = 'remove'[source]¶

classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (Dim|str)
out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.CombineLayer(kind, sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]¶

Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the left-hand operator.

Its basic working is similar to the reduce function used in functional programming. Also see ActivationLayer, or CompareLayer.

Parameters:

kind (str) – currently accepted values are average, add, sub, mul, truediv, floordiv, mod, pow, maximum, minimum, logical_and, logical_or, squared_difference, logaddexp, or eval, or any function in the tf.math or tf namespace.
sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.
activation (str|None) – if provided, activation function to apply, e.g. “tanh” or “relu”
with_bias (bool) – if given, will add a trainable bias tensor
eval (str|callable) – for kind=”eval”, will eval this string. or function. see _op_kind_eval()
eval_locals (dict[str]|None) – locals for eval
eval_for_output_loss (bool) – will do the same eval on layer.output_loss

layer_class: Optional[str] = 'combine'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(network, sources, eval_locals=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
eval_locals (dict[str]|None) – locals for eval, will also pass to out_type is out_type is a function
n_out (int|None|NotSpecified)
allow_broadcast_all_sources (bool|NotSpecified)
out_type (dict[str]|None|(()->Data))
out_shape (set[Dim|_MarkedDim]|tuple|list|None) – verifies the output shape (dim tags)

Return type:

Data

class returnn.tf.layers.basic.EvalLayer(eval, **kwargs)[source]¶

Evaluates some string. The CombineLayer provides this functionality, thus this is just a special case of it. Also see ActivationLayer, or CompareLayer.

The output type is defined as a broadcasted extension of all sources. You can overwrite it by (partially) specifying out_type. out_type can also be a generic Python function, returning a Data instance.

Parameters:: eval (str) – will eval this string. see _op_kind_eval()

layer_class: Optional[str] = 'eval'[source]¶

class returnn.tf.layers.basic.CompareLayer(kind='equal', value=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Compares element-wise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.

Example:

{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}

computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical “and” of all comparisons. Note that value is the last element to be compared to.

A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the end-of-sentence token:

"output": {"class": "rec", "from": [], "unit": {
    .
    .
    .
    "end": {"class": "compare", "from": "output", "value": end_of_sentence_id}
}, "target": "classes0"}

Parameters:

kind (str) – which comparison operation to use, e.g. “equal”, “greater”, “less” or other supported TF comparison ops
value (float|int|None) – if specified, will also compare to this
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.

layer_class: Optional[str] = 'compare'[source]¶

classmethod get_out_data_from_opts(sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_shape=None, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified)
n_out (int|None|NotSpecified)
out_type (dict[str]|None)
out_shape (dict[str]|None)

Return type:

Data

class returnn.tf.layers.basic.SwitchLayer(condition, true_from, false_from, **kwargs)[source]¶

Wrapper around tf.where() (or more generically returnn.tf.util.basic.where_bc()), or statically choose a single source if the condition is a callable (…)->bool. (tf.cond is not useful here, as the sources would have been already constructed and computed.)

This layer is also useful for applying any kind of generic masking to the frames. E.g. one could have a layer called “mask” computing a boolean mask for the values stored in another layer “input”. Then use this layer with condition=”mask”, true_from=”input”, false_from=mask_value, to mask out all frames where the mask is false with the mask_value.

See also CondLayer. See also SeqLenMaskLayer if you just want to mask using the sequence lengths.

Parameters:

condition (LayerBase|bool) – if callable, expected to be (…)->bool, and called in transform_config_dict
true_from (LayerBase|float|int|None)
false_from (LayerBase|float|int|None)

layer_class: Optional[str] = 'switch'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, condition, true_from, false_from, **kwargs)[source]¶

Parameters:

name (str)
condition (LayerBase|bool)
true_from (LayerBase|float|int|None)
false_from (LayerBase|float|int|None)

Return type:

Data

get_dep_layers()[source]¶

Return type:: list[LayerBase]

class returnn.tf.layers.basic.CondLayer(condition, true_layer, false_layer, _condition_network=None, _true_layer_network=None, _false_layer_network=None, _extra_out=None, **kwargs)[source]¶

See also SwitchLayer, which uses tf.where(). Here, we use tf.cond instead. I.e. the condition has to be a scalar bool, and only the corresponding true/false branch is computed.

true_layer/false_layer are layer dicts, which are in the same namescope as this layer, however, they are in the corresponding control flow context (tf.cond).

You can use SubnetworkLayer inside to embed any more complex logic.

There can be more than one output via sub-layers. Specifically, it will make all from get_available_sub_layer_names() available. In SubnetworkLayer, that are all the output layers in the sub-network.

Parameters:

condition (LayerBase|dict[str])
true_layer (LayerBase|dict[str])
false_layer (LayerBase|dict[str])
_extra_out (dict[str,(Data, type, dict[str])])

layer_class: Optional[str] = 'cond'[source]¶

recurrent = True[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)

classmethod get_out_data_from_opts(true_layer, false_layer, name, network, **kwargs)[source]¶

Parameters:

true_layer (LayerBase|dict[str])
false_layer (LayerBase|dict[str])
name (str)
network (returnn.tf.network.TFNetwork)

Return type:

Data

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str)
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layers()[source]¶

Return type:: list[LayerBase]

class returnn.tf.layers.basic.TopKLayer(axis, k, k_dim=None, sorted=True, **kwargs)[source]¶

Basically wraps tf.nn.top_k.

Directly returns the top_k values. The indices are accessible via the “indices” sub-layer.

For an input [B,D] with axis=D, the output and indices values are shape [B,K].

It’s somewhat similar to ReduceLayer with max and argmax. The axis dim is reduced and then a new dim for K is added.

Axis can also cover multiple axes, such as [beam,classes]. In that cases, there is not a single “indices” sub-layer, but sub-layers “indices0” .. “indices{N-1}” corresponding to each axis, in the same order.

All other axes are treated as batch dims.

Parameters:

axis (Dim|str|list[Dim|str]) – the axis to do the top_k on, which is reduced
k (int|LayerBase) – the “K” in “TopK”
k_dim (Dim|None) – the output dim tag corresponding to k
sorted (bool)

layer_class: Optional[str] = 'top_k'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, network, sources, axis, k, k_dim, **kwargs)[source]¶

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
axis (Dim|str|list[Dim|str]) – the axis to do the top_k on, which is reduced
k (int|LayerBase) – the “K” in “TopK”
k_dim (Dim|None) – the output dim tag corresponding to k

Return type:

Data

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str) – sub layer name
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

class returnn.tf.layers.basic.SearchSortedLayer(sorted_sequence, values, axis='T', side='left', **kwargs)[source]¶

Basically wraps tf.searchsorted().

Takes a tensor sorted_sequence that is sorted along one axis, and a tensor values. Will compute an output tensor with the same axes as values, where each entry is the index of the value within the sorted sequence. All (batch) axes of sorted_sequence except for the axis it is sorted along must be present in values.

Parameters:

sorted_sequence (LayerBase)
values (LayerBase) – search values
axis (str) – the axis along which sorted_sequence is sorted
side (str) – “left” or “right”. When one of the values exactly matches an element of the sorted_sequence, whether to choose the lower or higher index.

layer_class: Optional[str] = 'search_sorted'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(sorted_sequence, values, axis, name, network, **kwargs)[source]¶

Parameters:

sorted_sequence (LayerBase)
values (LayerBase) – search values
axis (str) – the axis along which sorted_sequence is sorted
name (str)
network (returnn.tf.network.TFNetwork)

Return type:

Data

class returnn.tf.layers.basic.SubnetworkLayer(subnetwork, _subnet, _output, concat_sources=True, load_on_init=None, dropout=0, dropout_noise_shape=None, _parent_layer_cache=None, _from=None, **kwargs)[source]¶

You can define a whole subnetwork as a single layer by this class.

The subnetwork will be specified by a dict[str,dict[str]], just like a normal network is specified in the config.

The "output" layer of the subnetwork will be the output of this subnetwork-layer.

With concat_sources=True (default),: the input to this layer will be represented as the "data:data" or simply "data" in the subnetwork,
otherwise with concat_sources=False,: the input to this layer will be represented as "data:input_layer_name" and also "data:0" to "data:<n-1>" for n inputs, for each input, in the subnetwork. The first input will also be simply available as "data:data"/``”data”`.

Parameters:

subnetwork (dict[str,dict]) – subnetwork as dict (JSON content). must have an “output” layer-
concat_sources (bool) – if we concatenate all sources into one, like it is standard for most other layers
load_on_init (str|dict[str]|None) – if provided, for parameter initialization, we will load the given model file. see CustomCheckpointLoader.
dropout (float) – will be applied if train_flag is set
dropout_noise_shape (tuple|list|dict|None)
_parent_layer_cache (dict[str,LayerBase]|None)
_subnet (returnn.tf.network.Subnetwork)
_output (LayerBase)

layer_class: Optional[str] = 'subnetwork'[source]¶

recurrent = True[source]¶

update_params_from_subnet()[source]¶: Update self.params.

update_rec_vars_outputs()[source]¶: Update self.rec_vars_outputs.

update_load_on_init()[source]¶: Handle load_on_init.

classmethod get_out_data_from_opts(n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, **kwargs)[source]¶

Parameters:

n_out (int|None|NotSpecified)
out_type (dict[str]|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

classmethod cls_get_sub_network(name, network, layer_desc)[source]¶

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
layer_desc (dict[str])

Return type:

returnn.tf.network.Subnetwork|None

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
Returns:: the sub_layer addressed in layer_name or None if no sub_layer exists
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

get_sub_networks()[source]¶

Return type:: list[returnn.tf.network.TFNetwork]

get_sub_layers()[source]¶

Return type:: list[LayerBase]

get_dep_layers()[source]¶

Returns:: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:: list[LayerBase]

get_last_hidden_state(key)[source]¶

Parameters:: key (int|str|None) – also the special key “*”
Return type:: tf.Tensor|None

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, encapsulate=False, **kwargs)[source]¶

Parameters:

batch_dim (tf.Tensor) – for this layer, might be with beam
rec_layer (returnn.tf.layers.rec.RecLayer)
encapsulate (bool)

Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, encapsulate=False, **kwargs)[source]¶

Parameters:

rec_layer (returnn.tf.layers.rec.RecLayer)
encapsulate (bool)

Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

class returnn.tf.layers.basic.TrainFlagLayer(**kwargs)[source]¶

Returns the train flag (bool scalar) of the current network.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().
sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.
reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.
name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().
param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.
only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.
rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.
encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

If False, the logic in cls_get_sub_network() will be used.
collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.
debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network
_src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'train_flag'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, **kwargs)[source]¶

Parameters:: name (str)
Return type:: Data

class returnn.tf.layers.basic.GlobalTrainStepLayer(**kwargs)[source]¶

Returns the global train step (int64 scalar).

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().
sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.
reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.
name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().
param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.
only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.
rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.
encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

If False, the logic in cls_get_sub_network() will be used.
collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.
debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network
_src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'global_train_step'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, **kwargs)[source]¶

Parameters:: name (str)
Return type:: Data

class returnn.tf.layers.basic.AccumulateMeanLayer(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]¶

Accumulates the mean of the input (in training) (over batch-dim and time-dim by default). It’s similar to ReduceLayer

Parameters:

exp_average (float) – momentum in exponential average calculation
axes (int|list[str]|str) – the axes to reduce. must contain batch and time.
initial_value (float) – how to initialize the variable which accumulates the mean
is_prob_distribution (bool) – if provided, better default for initial_value

layer_class: Optional[str] = 'accumulate_mean'[source]¶

classmethod get_out_data_from_opts(axes='bt', **kwargs)[source]¶

Parameters:: axes (str)
Return type:: Data

class returnn.tf.layers.basic.LossLayer(loss_, target_=None, use_error=False, **kwargs)[source]¶

This layers wraps a Loss calculation as a layer. I.e. the loss will be calculated and returned by the layer. But this loss will not be used as a loss by the updater. If you want to use it as a loss, you can use the AsIsLoss, i.e. write "loss": "as_is".

Note that the loss options for the wrapped loss need to be provided via loss_opts_, and it does not apply any reduce function.

Note

The LossLayer might be deprecated in the future in favor of implementing the losses as actual layers.

If you want to define a loss inside the network, it is recommended to define it explicitly. An example could be:

"se_loss": {"class": "eval", "eval": "(source(0) - source(1)) ** 2", "from": ["output", "data:classes"]}

Followed by an e.g. mean reduce if needed:

"mse_loss": {"class": "reduce", "mode": "mean", "axis": "F", "from": "se_loss"}

loss_ and related params have the postfix _ to distinguish them from the loss options, which are used by the network and updater for training. Some of these (e.g. loss_opts_) are handled in transform_config_dict().

Parameters:

loss (Loss)
target (LayerBase|None)
use_error (bool) – whether to output the loss error instead of the loss value

layer_class: Optional[str] = 'loss'[source]¶

recurrent = True[source]¶

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str) – sub layer name
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, target_=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
target (LayerBase|None)

Return type:

Data

class returnn.tf.layers.basic.ForcedAlignmentLayer(align_target, topology, input_type, blank_idx=-1, blank_included=False, **kwargs)[source]¶

Calculates a forced alignment, via Viterbi algorithm.

Parameters:

align_target (LayerBase)
topology (str) – e.g. “ctc” or “rna” (RNA is CTC without label loop)
input_type (str) – “log_prob” or “prob”
blank_idx (int) – vocab index of the blank symbol
blank_included (bool) – whether blank token of the align target is included in the vocabulary

layer_class: Optional[str] = 'forced_align'[source]¶

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶

Parameters:

layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layer(layer_name)[source]¶

Parameters:: layer_name (str)
Return type:: LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶

Parameters:: parent_layer_kwargs (dict[str])
Return type:: list[str]

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.SparseSoftmaxCrossEntropyWithLogitsLayer(logits, targets, axis=None, **kwargs)[source]¶

This is a simple wrapper for tf.nn.sparse_softmax_cross_entropy_with_logits.

Parameters:

logits (LayerBase)
targets (LayerBase)
axis (Dim|str|None) – feature dim by default

layer_class: Optional[str] = 'sparse_softmax_cross_entropy_with_logits'[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, logits, axis=None, **kwargs)[source]¶

Parameters:

name (str)
logits (LayerBase)
axis (Dim|str|None) – feature dim by default

class returnn.tf.layers.basic.CtcLossLayer(logits, targets, logits_normalized=False, blank_index=-1, max_approx=False, label_loop: bool = True, **kwargs)[source]¶

Calculates the CTC loss.

Internally, this uses returnn.tf.native_op.ctc_loss() which is equivalent to tf.nn.ctc_loss but more efficient.

Output is of shape [B].

Parameters:

logits (LayerBase) – (before softmax). shape [B,T,D]
targets (LayerBase) – sparse. shape [B,T]
logits_normalized (bool) – whether the logits are already normalized (e.g. via log-softmax)
blank_index (int) – vocab index of the blank symbol
max_approx (bool) – if True, use max instead of sum over alignments (max approx, Viterbi)
label_loop

layer_class: Optional[str] = 'ctc_loss'[source]¶

recurrent = True[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, **kwargs)[source]¶

Parameters:: name (str)

class returnn.tf.layers.basic.FastBaumWelchLayer(align_target, align_target_key=None, ctc_opts=None, sprint_opts=None, input_type='log_prob', tdp_scale=1.0, am_scale=1.0, min_prob=0.0, staircase_seq_len_source=None, **kwargs)[source]¶

Calls fast_baum_welch() or fast_baum_welch_by_sprint_automata(). We expect that our input are +log scores, e.g. use log-softmax.

Parameters:

align_target (str) – e.g. “sprint”, “ctc” or “staircase”
align_target_key (str|None) – e.g. “classes”, used for e.g. align_target “ctc”
ctc_opts (dict[str]) – used for align_target “ctc”
sprint_opts (dict[str]) – used for Sprint (RASR) for align_target “sprint”
input_type (str) – “log_prob” or “prob”
tdp_scale (float)
am_scale (float)
min_prob (float) – clips the minimum prob (value in [0,1])
staircase_seq_len_source (LayerBase|None)

layer_class: Optional[str] = 'fast_bw'[source]¶

recurrent = True[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.GradientLayer(y: LayerBase, x: LayerBase, **kwargs)[source]¶

Calculates the gradient of y w.r.t. x.

Parameters:

layer_class: Optional[str] = 'gradient'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(y: LayerBase, x: LayerBase, name: str, **kwargs)[source]¶

Parameters:

y (LayerBase)
x (LayerBase)
name (str)

Return type:

Data

class returnn.tf.layers.basic.SyntheticGradientLayer(gradient, meta_loss_scale=1.0, **kwargs)[source]¶

This is a generalized way to be able to replace the true gradient with any kind of predicted gradient. This enabled to implement the idea from here:

Decoupled Neural Interfaces using Synthetic Gradients, https://arxiv.org/abs/1608.05343

Parameters:

gradient (LayerBase)
meta_loss_scale (float)

layer_class: Optional[str] = 'synthetic_gradient'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

classmethod get_out_data_from_opts(sources, name, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
name (str)

Return type:

Data

class returnn.tf.layers.basic.TikhonovRegularizationLayer(meta_loss_scale=1.0, **kwargs)[source]¶

Adds the Tikhonov regularization as a meta-loss (see returnn.tf.util.basic.MetaLosses).

Parameters:: meta_loss_scale (float)

layer_class: Optional[str] = 'tikhonov_regularization'[source]¶

class returnn.tf.layers.basic.FramewiseStatisticsLayer(sil_label_idx, histogram_num_bins=20, **kwargs)[source]¶

Collects various statistics (such as FER, etc) on the sources. The tensors will get stored in self.stats which will be collected by TFEngine.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:

name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().
sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.
reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.
name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().
param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.
only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.
rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.
encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

If False, the logic in cls_get_sub_network() will be used.
collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.
debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network
_src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'framewise_statistics'[source]¶

classmethod get_out_data_from_opts(**kwargs)[source]¶

Return type:: Data

class returnn.tf.layers.basic.PrintLayer(summarize=99, extra_print_args=(), **kwargs)[source]¶

Prints the sources to console/log, via returnn.tf.util.basic.py_print().

Parameters:

summarize (int|None) – passed to py_print()
extra_print_args (list|tuple)

layer_class: Optional[str] = 'print'[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.HDFDumpLayer(filename, extra=None, dump_whole_batches=False, labels=None, extend_existing_file=False, dump_per_run=False, **kwargs)[source]¶

Dumps into HDF file, compatible to HDFDataset.

The HDF will be written to disk under the specified filename, if there was no error, by default at graph reset, via TFNetwork.register_graph_reset_callback(). Or after the dataset iteration run loop, with dump_per_run, via TFNetwork.register_run_finished_callback().

Common usage would be to add this to your network with “is_output_layer”: True, such that you don’t need to make other layers depend on it.

It currently uses SimpleHDFWriter internally.

Parameters:

filename (str|(()->str))
extra (None|dict[str,LayerBase])
dump_whole_batches (bool) – dumps the whole batch as a single sequence into the HDF
labels (list[str]|None)
extend_existing_file (bool) – True also means we expect that it exists
dump_per_run (bool) – write via TFNetwork.register_run_finished_callback()

layer_class: Optional[str] = 'hdf_dump'[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.ImageSummaryLayer(max_outputs=3, **kwargs)[source]¶

Creates image summaries which can be viewed in TensorBoard. This layer expects the source to be in (T-decoder, T-encoder, B, 1).

Parameters:: max_outputs – number of images to generate per step

layer_class: Optional[str] = 'image_summary'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace, the loss_opts
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(**kwargs)[source]¶

Return type:: Data

class returnn.tf.layers.basic.CrossEntropyLoss(input_type='prob', focal_loss_factor=0.0, label_smoothing=0.0, label_smoothing_gaussian=False, debug_dump=False, safe_log_opts=None, use_fused=True, fake_upper_bound=None, **kwargs)[source]¶

Cross-Entropy loss. Basically sum(target * log(output)).

Parameters:

input_type (str) – “prob” (default) or “logits”
focal_loss_factor (float) – see https://arxiv.org/abs/1708.02002. 0 means disabled
label_smoothing (float) – 0.1 is a common default. see returnn.tf.util.basic.smoothing_cross_entropy()
label_smoothing_gaussian (bool) – see returnn.tf.util.basic.smoothing_cross_entropy()
debug_dump (bool)
safe_log_opts (dict[str]) – passed to safe_log()
use_fused (bool) – if possible, use fused opts
fake_upper_bound (float|None) – uses returnn.tf.util.basic.minimum_with_identity_grad(). I.e. you will see a finite loss, but we use the original gradient (which should be safe).

class_name: str = 'ce'[source]¶

need_target = True[source]¶

get_output_target_scores()[source]¶

Returns:: shape (time_flat,), type float32, std-prob space
Return type:: tf.Tensor

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.BinaryCrossEntropyLoss(pos_weight=None, **kwargs)[source]¶

Binary cross entropy. We expect the output as logits, not in probability space! Per frame: mean(target * log(sigmoid(output)) + (1 - target) * log(1 - sigmoid(output)))

Parameters:: pos_weight (float|None) – weight of positive labels, see tf.nn.weighted_cross_entropy_with_logits.

class_name: str = 'bin_ce'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Returns:: frame error rate as a scalar value with the default self.reduce_func (see also self.get_value)
Return type:: tf.Tensor

class returnn.tf.layers.basic.GenericCELoss(**kwargs)[source]¶

Some generalization of cross entropy.

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'generic_ce'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.CtcLoss(target_collapse_repeated=False, auto_clip_target_len=False, output_in_log_space=False, beam_width=100, ctc_opts=None, use_native=False, use_viterbi=False, **kwargs)[source]¶

Connectionist Temporal Classification (CTC) loss. Basically a wrapper around tf.nn.ctc_loss.

Parameters:

target_collapse_repeated (bool) – like preprocess_collapse_repeated option for CTC. used for sparse_labels().
auto_clip_target_len (bool) – see self._get_target_sparse_labels().
output_in_log_space (bool) – False -> output expected in prob space. see self.get_output_logits
beam_width (int) – used in eval
ctc_opts (dict[str]|None) – other kwargs used for tf.nn.ctc_loss
use_native (bool) – use our native implementation (TFNativeOp.ctc_loss())
use_viterbi (bool) – instead of full-sum, use only best path (via ctc_loss_viterbi())

class_name: str = 'ctc'[source]¶

recurrent = True[source]¶

init(**kwargs)[source]¶: See super.

get_output_logits()[source]¶

Returns:: outputs in log-space / logits
Return type:: tf.Tensor

get_soft_alignment()[source]¶

Also called the Baum-Welch-alignment. This is basically p_t(s|x_1^T,w_1^N), where s are the output labels (including blank), and w are the real target labels.

Returns:: shape (time, batch, dim)
Return type:: tf.Tensor

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: tf.Tensor

classmethod get_auto_output_layer_dim(target_dim)[source]¶

Parameters:: target_dim (returnn.tensor.Dim)
Return type:: returnn.tensor.Dim

class returnn.tf.layers.basic.EditDistanceLoss(debug_print=False, label_map=None, ctc_decode=False, output_in_log_space=False, **kwargs)[source]¶

Note that this loss is not differentiable, thus it’s only for keeping statistics.

Parameters:

debug_print (bool) – will tf.Print the sequence
label_map (dict[int,int]|None) – before calculating the edit-distance, will apply this map
ctc_decode (bool) – True -> expects dense output and does CTC decode, False -> expects sparse labels in output
output_in_log_space (bool) – False -> dense output expected in prob space. see self.get_output_logits

class_name: str = 'edit_distance'[source]¶

recurrent = True[source]¶

init(output, output_with_activation=None, target=None, **kwargs)[source]¶

Parameters:

output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset

get_output_logits()[source]¶

Returns:: outputs in log-space / logits
Return type:: tf.Tensor

get_error()[source]¶

Return type:: tf.Tensor

get_value()[source]¶

Return type:: None

class returnn.tf.layers.basic.BleuLoss(**kwargs)[source]¶

Note that this loss is not differentiable, thus it’s only for keeping statistics. Also, BLEU is a score, i.e. the higher, the better. Thus, to interpret it as a loss or error, we take the negative value.

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'bleu'[source]¶

recurrent = True[source]¶

init(output, output_with_activation=None, target=None, **kwargs)[source]¶

Parameters:

output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset

get_error()[source]¶

Return type:: tf.Tensor

get_value()[source]¶

Return type:: None

class returnn.tf.layers.basic.ExpectedLoss(loss, loss_kind, norm_scores=True, norm_scores_stop_gradient=True, divide_beam_size=True, subtract_average_loss=True, loss_correction_grad_only=False, **kwargs)[source]¶

This loss uses another loss error or value and given the search beam scores, calculates the expected loss. Sometimes also called minimum Bayes risk.

Parameters:

loss (Loss)
loss_kind (str) – “error” or “value”. whether to use loss.get_error() or loss.get_value()
norm_scores (bool)
norm_scores_stop_gradient (bool)
divide_beam_size (bool)
subtract_average_loss (bool)
loss_correction_grad_only (bool)

class_name: str = 'expected_loss'[source]¶

recurrent = True[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

init(**kwargs)[source]¶: Overwrites super. Get search choices.

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: None

class returnn.tf.layers.basic.DeepClusteringLoss(embedding_dimension, nr_of_sources, **kwargs)[source]¶

Cost function used for deep clustering as described in [Hershey & Chen+, 2016]: “Deep clustering discriminative embeddings for segmentation and separation”

Parameters:

embedding_dimension (int)
nr_of_sources (int)

class_name: str = 'deep_clustering'[source]¶

get_error()[source]¶

Returns:: frame error rate as a scalar value
Return type:: tf.Tensor | None

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.L1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶

L1-distance loss. sum(target - output).

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'l1'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.MeanSquaredError(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶

The generic mean squared error loss function

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'mse'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.MeanL1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶

Like MSE loss, but with absolute difference

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'mean_l1'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.ExternSprintLoss(sprint_opts, **kwargs)[source]¶

The loss is calculated by an extern Sprint instance.

Parameters:: sprint_opts (dict[str])

class_name: str = 'sprint'[source]¶

recurrent = True[source]¶

need_target = False[source]¶

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: tf.Tensor|None

class returnn.tf.layers.basic.FastBaumWelchLoss(sprint_opts, tdp_scale=1.0, **kwargs)[source]¶

The loss is calculated via fast_baum_welch(). The automata are created by an extern Sprint instance.

Parameters:: sprint_opts (dict[str])

class_name: str = 'fast_bw'[source]¶

recurrent = True[source]¶

need_target = False[source]¶

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: tf.Tensor|None

class returnn.tf.layers.basic.ViaLayerLoss(error_signal_layer=None, align_layer=None, loss_wrt_to_act_in=False, **kwargs)[source]¶

The loss error signal and loss value is defined as the output of another layer. That way, you can define any custom loss. This could e.g. be used together with the fast_bw layer.

This is a more custom variant of AsIsLoss, which simply takes the output of a layer as loss without redefining the error signal (gradient).

Parameters:

error_signal_layer (LayerBase)
align_layer (LayerBase)
loss_wrt_to_act_in (bool|str) – if True, we expect that the given output_with_activation is set, and the given error signal is w.r.t. the input of the specific activation function. A common example is the input to the softmax function, where the gradient is much more stable to define, e.g. y - z instead of y/z for cross entropy. If you specify a str, e.g. “softmax” or “log_softmax”, there is an additional check that the used activation function is really that one.

class_name: str = 'via_layer'[source]¶

recurrent = True[source]¶

need_target = False[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace, the loss_opts
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: tf.Tensor|None

class returnn.tf.layers.basic.AsIsLoss(as_error=False, **kwargs)[source]¶

Use the output as-is as the loss.

Also see ViaLayerLoss which also allows to define a custom error signal (gradient).

Parameters:: as_error (bool) – if True, use the output as error, otherwise (default) use the output as loss value. Error is purely for reporting, loss value is used for the optimizer as well (when scale != 0).

class_name: str = 'as_is'[source]¶

need_target = False[source]¶

get_value()[source]¶

Return type:: tf.Tensor|None

get_error()[source]¶

Return type:: tf.Tensor|None

class returnn.tf.layers.basic.SearchScoreLoss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶

Use the scores from SearchChoices.

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'search_score'[source]¶

need_target = False[source]¶

reduce_to_batch(loss, normalize)[source]¶

Parameters:

loss (tf.Tensor) – (batch,)
normalize (bool) – reduce mean instead of reduce sum

Returns:

(batch,)

Return type:

tf.Tensor

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶

Return type:: None

class returnn.tf.layers.basic.SamplingBasedLoss(num_sampled=128, num_splits=1, sampler='log_uniform', nce_loss=False, use_full_softmax=False, remove_accidental_hits=None, sampler_args=None, nce_log_norm_term=0.0, **kwargs)[source]¶

Implement two sampling based losses, sampled softmax (default) and noise contrastive estimation. https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss. https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.

Must be used in an output linear layer with a weight matrix of shape (num_classes, dim). When using ‘log_uniform’ sampler (default), optimal performance is typically achieved with the vocabulary list sorted in decreasing order of frequency (https://www.tensorflow.org/api_docs/python/tf/random/log_uniform_candidate_sampler).

Parameters:

num_sampled (int) – Number of classes to be sampled. For sampled softmax, this is the number of classes to be used to estimate the sampled softmax. For noise contrastive estimation, this is the number of noise samples.
num_splits (int) – Number of different samples (each with ‘num_sampled’ classes) to be used per batch.
sampler (str) – Specify sampling distribution (“uniform”, “log_uniform”, “learned_unigram” or “fixed_unigram”).
nce_loss (bool) – If True, use noise contrastive estimation loss. Else (default), use the sampled softmax.
use_full_softmax (bool) – If True, compute the full softmax instead of sampling (can be used for evaluation).
remove_accidental_hits (bool|None) – If True, remove sampled classes that equal one of the target classes. If not specified (None), the value is determined based on the choosen objective. For sampled softmax this should be set to True; for NCE the default is False. Set this to True in case of NCE training and the objective is equal to sampled logistic loss.
sampler_args (dict[str]) – additional arguments for the candidate sampler. This is most relevant to the fixed_unigram sampler. See https://www.tensorflow.org/api_docs/python/tf/random/fixed_unigram_candidate_sampler for details.
nce_log_norm_term (float) – The logarithm of the constant normalization term for NCE.

class_name: str = 'sampling_loss'[source]¶

get_value()[source]¶

Return type:: tf.Tensor

class returnn.tf.layers.basic.TripletLoss(margin, multi_view_training=False, **kwargs)[source]¶

Triplet loss: loss = max(margin + d(x_a, x_s) - d(x_a, x_d), 0.0) Triplet loss is used for metric learning in a siamese/triplet network. It should be used as a part of CopyLayer with 3 inputs corresponding to

x_a, x_s and x_d in a loss.

Here we assume that x_a are anchor samples, x_s are samples where: at each position i in a minibatch x_ai and x_si belong to the same class, while pairs x_ai and x_di belong to different classes.

In this implementation the number of training examples is increased by extracting all possible same/different pairs within a minibatch.

Parameters:

base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.
custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)

class_name: str = 'triplet_loss'[source]¶

init(output, output_with_activation=None, target=None, **kwargs)[source]¶

Parameters:

output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset

get_value()[source]¶

Return type:: tf.Tensor

get_error()[source]¶: Error is not defined for triplet_loss :return: None

returnn.tf.layers.basic.get_loss_class(loss)[source]¶

Parameters:: loss (str) – loss type such as “ce”
Return type:: (() -> Loss) | type[Loss] | Loss

returnn.tf.layers.basic.auto_register_layer_classes(vars_values)[source]¶

Example usage:

from returnn.tf.layers.basic import auto_register_layer_classes
auto_register_layer_classes('extern_private/your_stuff/CoolThingy.py')

Parameters:: vars_values (list|types.ModuleType|str) – e.g. use list(globals().values()). str is considered as a module-filename
Returns:: nothing

returnn.tf.layers.basic.register_layer_class(layer_class)[source]¶

Registers a layer class such that it can be used in network construction.

Parameters:: layer_class (type[LayerBase])
Returns:: nothing

returnn.tf.layers.basic.get_layer_class(name)[source]¶

Parameters:: name (str) – matches layer_class
Return type:: (() -> LayerBase) | type[LayerBase] | LayerBase

returnn.tf.layers.basic.get_layer_class_name_list()[source]¶

Return type:: list[str]

returnn.tf.layers.basic¶

`returnn.tf.layers.basic`¶