returnn.tf.layers.basic

Many canonical basic layers.

class returnn.tf.layers.basic.SourceLayer(network, data_key=None, sources=(), **kwargs)[source]

This gives access to some entry from network.extern_data (ExternData).

Parameters:
layer_class: Optional[str] = 'source'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(network, data_key=None, **kwargs)[source]
Parameters:
Return type:

Data

returnn.tf.layers.basic.concat_sources(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]
Parameters:
Returns:

data with placeholders set

Return type:

Data

returnn.tf.layers.basic.get_concat_sources_data_template(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, name=None)[source]

This just creates a template Data instance, without creating any real TF tensors. concat_sources() (and related) are the equivalent functions which would create a Data together with the tensor.

Parameters:
  • src_layers (Sequence[LayerBase])

  • out_dim (Dim|None)

  • allow_broadcast_all_sources (bool|NotSpecified)

  • name (str|None) – name of the Data

Returns:

data with no placeholders set. it is always a copy or new instance, so safe to manipulate

Return type:

Data

returnn.tf.layers.basic.concat_sources_with_opt_dropout(src_layers, out_dim=None, dropout=0, dropout_axis=None, dropout_noise_shape=None, dropout_on_forward=False, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]

Concatenates in the feature dim (see concat_sources()), and then optionally applies dropout.

Parameters:
  • src_layers (list[LayerBase])

  • out_dim (Dim|None)

  • dropout (float) – dropout rate that will be applied if train_flag is set or dropout_on_forward is enabled

  • dropout_axis (Dim|str|list[Dim|str]|None)

  • dropout_noise_shape (tuple|list|dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – provide 1 for broadcasting or None otherwise for each axis. The default “None” will broadcast across all dynamic axes including the batch axis. Use {“*”: None} to disable broadcasting for all axes.

  • dropout_on_forward (bool) – apply dropout also during inference

  • allow_broadcast_all_sources (bool|NotSpecified)

Returns:

data with placeholders set

Return type:

Data

class returnn.tf.layers.basic.CopyLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]

This layer does nothing, it copies its input. This is not even a tf.identity. It refers to the same TF tensor. If multiple sources are provided, they are concatenated in the feature-dim.

Parameters:
  • in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.

  • out_dim (Dim|None) – alternative to in_dim. see in_dim doc.

  • extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, use IdentityLayer.

layer_class: Optional[str] = 'copy'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources=(), extra_deps=(), out_type=None, in_dim=None, out_dim=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • extra_deps (list[LayerBase])

  • out_type (dict[str]|None)

  • in_dim (Dim|None)

  • out_dim (Dim|None)

  • n_out (int|None|NotSpecified)

  • out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.IdentityLayer(sources: List[LayerBase], control_dependencies: Sequence[LayerBase] | None = None, **kwargs)[source]

Wraps tf.identity with potential control dependencies.

The difference to CopyLayer is that this creates a new TF op (tf.identity), which allows for potential control dependencies. This is the whole purpose of this layer.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'identity'[source]
get_dep_layers() List[LayerBase][source]

deps

classmethod get_out_data_from_opts(name: str, sources: List[LayerBase], **kwargs)[source]

out

classmethod transform_config_dict(d, network, get_layer)[source]

transform

class returnn.tf.layers.basic.ConcatLayer(sources, allow_broadcast=False, out_dim=None, **kwargs)[source]

Concatenates the inputs in specified axes. This generalizes CopyLayer which concatenates in the feature dim.

Parameters:
  • sources (list[(LayerBase,str|Dim)])

  • allow_broadcast (bool)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'concat'[source]
classmethod get_out_data_from_opts(name, sources, out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.DropoutLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]

Just the same as CopyLayer, because that one already supports dropout.

Parameters:
  • in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.

  • out_dim (Dim|None) – alternative to in_dim. see in_dim doc.

  • extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, use IdentityLayer.

layer_class: Optional[str] = 'dropout'[source]
class returnn.tf.layers.basic.ScaledGradientLayer(scale, shift=None, scale_shift_by_sum_over_axis=None, clip_max_axis=None, **kwargs)[source]

Just tf.identity() in the forward pass. Scales the gradient by some factor in backprop. Can be used as gradient reversal layer (with negative factor). Uses returnn.tf.util.basic.scaled_gradient(), or tf.stop_gradient()

Parameters:
  • scale (float|LayerBase) – if 0. and no shift, will use tf.stop_gradient

  • shift (float|LayerBase|None)

  • scale_shift_by_sum_over_axis (Dim|str|None) – if given, calculates the sum over this axis (absolute values) and multiplies the shift value by this sum.

  • clip_max_axis (Dim|str|None) – if given, clips the gradient to the max value in this axis before the transformation, for all values in the axis

layer_class: Optional[str] = 'scaled_grad'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.SelectSearchSourcesLayer(search_choices_layer, sources, **kwargs)[source]

Selects the corresponding search beams from the source, given current search choices (determined by a layer). Like InternalLayer, only for internal purpose at the moment.

Parameters:
classmethod select_if_needed(layer, search_choices)[source]
Parameters:
Return type:

LayerBase

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, search_choices, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.ActivationLayer(activation, opts=None, **kwargs)[source]

This layer just applies an activation function. See returnn.tf.util.basic.get_activation_function() about supported functions. Also see EvalLayer and CombineLayer for similar layers.

Parameters:
  • activation (str) – e.g. “relu”, “tanh”, etc

  • opts (dict[str]|None) – for activation function, e.g. eps for safe_log

layer_class: Optional[str] = 'activation'[source]
classmethod get_out_data_from_opts(activation, **kwargs)[source]
Parameters:

activation (str)

Return type:

Data

class returnn.tf.layers.basic.BatchNormLayer(in_dim=None, use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]

Implements batch-normalization (https://arxiv.org/abs/1502.03167) as a separate layer.

Also see NormLayer.

Parameters:
  • in_dim (returnn.tensor.Dim|None)

  • use_shift (bool)

  • use_std (bool)

  • use_sample (float) – defaults to 0.0 which is used in training

  • force_sample (bool) – even in eval, use the use_sample factor

  • momentum (float) – for the running average of sample_mean and sample_std

  • update_sample_only_in_training (bool)

  • delay_sample_update (bool)

  • param_version (int) – 0 or 1 or 2

  • epsilon (float)

  • gamma_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the scale

  • beta_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the mean

  • masked_time (bool) – flatten and mask input tensor

The default settings for these variables are set in the function batch_norm() of LayerBase. If you do not want to change them you can leave them undefined here. With our default settings:

  • In training: use_sample=0, i.e. not using running average, using current batch mean/var.

  • Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.

  • The running average includes the statistics of the current batch.

  • The running average is also updated when not training.

layer_class: Optional[str] = 'batch_norm'[source]
class returnn.tf.layers.basic.LayerNormLayer(in_dim=None, out_dim=None, epsilon=1e-06, **kwargs)[source]

Applies layer-normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see NormLayer.

Parameters:
  • in_dim (Dim|None) – axis to normalize over. feature-dim by default

  • out_dim (Dim|None) – just the same as in_dim

  • epsilon (float)

layer_class: Optional[str] = 'layer_norm'[source]
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.NormLayer(axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, param_shape=<class 'returnn.util.basic.NotSpecified'>, scale=True, bias=True, epsilon=1e-06, **kwargs)[source]

Normalize over specified axes, e.g. time and/or feature axis.

Note: For calculating a norm, see MathNormLayer instead.

In case of just feature (axes="F"), this corresponds to layer normalization (see LayerNormLayer). In case of time and feature (axes="TF") for a 3D input, or more general all except batch (axes="except_batch"), this corresponds to group normalization with G=1, or non-standard layer normalization. (The definition of layer-normalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)

The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see BatchNormLayer).

For some discussion on the definition of layer-norm vs group-norm, also see here and here.

Parameters:
  • axis (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”

  • axes (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”

  • param_shape (Dim|str|list[Dim|str]|tuple[Dim|str]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the feature-dim. This is also the default, i.e. a param-shape of [F], independent of the axes to normalize over.

  • scale (bool) – add trainable scale parameters

  • bias (bool) – add trainable bias parameters

  • epsilon (float) – epsilon for numerical stability

layer_class: Optional[str] = 'norm'[source]
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.MathNormLayer(p, axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, keep_dims=False, **kwargs)[source]

Calculates sum(abs(x) ** p) ** (1./p).

Parameters:
  • p (int|float)

  • axis (Dim|str|list[Dim|str])

  • axes (Dim|str|list[Dim|str])

  • keep_dims (bool)

layer_class: Optional[str] = 'math_norm'[source]
classmethod get_out_data_from_opts(name, sources, axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, keep_dims=False, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis (Dim|str|list[Dim|str])

  • axes (Dim|str|list[Dim|str])

  • keep_dims (bool)

Return type:

Data

class returnn.tf.layers.basic.SliceLayer(axis, slice_start=None, slice_end=None, slice_step=None, out_dim=None, **kwargs)[source]

Slicing on the input, i.e. x[start:end:step] in some axis. See also SliceNdLayer, for variable start. See also GatherLayer, for one single position.

Note that __getitem__ on a TF tensor (or also Numpy ND array) is more generic, and supports slices in multiple axes, as well as adding new dimensions, etc. It even allows to get boolean values, and then applies a boolean mask. See TF _slice_helper (== tf.Tensor.__getitem__) for a generic implementation, which calls tf.strided_slice. If we ever need such more generic support, we might consider adding a new layer, like GenericSliceLayer, which gets a splice_spec, just like _slice_helper (argument to __getitem__). But any such a slice can already be constructed with multiple individual layers, which perform individual slices (per axis).

We just support slicing in a single axis here, with optional striding (slice_step).

Parameters:
  • axis (Dim|str)

  • axis_kind (str|None) – “T” for time, “B” for batch, “F” for feature

  • slice_start (int|None)

  • slice_end (int|None)

  • slice_step (int|None)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'slice'[source]
classmethod get_out_data_from_opts(name, axis, sources=(), slice_start=None, slice_end=None, slice_step=None, out_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • axis (Dim|str)

  • sources (list[LayerBase])

  • slice_start (int|None)

  • slice_end (int|None)

  • slice_step (int|None)

  • out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.SliceNdLayer(size, start=None, min_size=None, axis='T', out_spatial_dim=None, **kwargs)[source]

This takes out a slice-range from the time axis, e.g. x[start:start + size]. If the input is of shape (B,T,F) and start is of shape (B,), then the output will be of shape (B,size,F). If the input is of shape (B,T,F) and start is of shape (B,T), then the output will be of shape (B,T,size,F). This layer allows a different start slice point for each batch, in contrast to SliceLayer, and the start is variable. See also GatherNdLayer. PrefixInTimeLayer can recover the original shape (by zero-padding).

Parameters:
  • start (int|LayerBase|None) – (B,…)

  • size (int|LayerBase|Dim|None) – We assume that this is >=0. If this might not be the case, use min_size=0. If None, it uses the max possible size, and it becomes a dynamic axis.

  • min_size (int|None) – if size is None, but we want to have a min-size

  • axis (Dim|str)

  • out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'slice_nd'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources=(), start=None, size=None, axis='T', out_spatial_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.GatherLayer(position: LayerBase | int, axis: Dim | str, clip_to_valid: bool = False, **kwargs)[source]

Gathers slices on a specified axis from the input layer using indices from a position layer. If the input is a layer of the shape [B,D,F1], and position of shape [B,F2], this will yield output of the shape [B,F2,F1] where

output[b,f2,f1] = input[b,position[b,f2],f1]

(if D is the axis to gather from). In general, all shared axes of the input and the positions will be considered as batch-axes.

The position argument can also be an int. In this case, this simply gives input[position] one the specified axis.

It’s basically a wrapper around tf.gather. It provides the same functionality as the deprecated GatherNdLayer, but is more generic. See also GatherNdLayer.

Parameters:
  • position – indices used to select the slices of the input from. If another layer, must be of type int32 or int64. Can also specify a constant int.

  • axis – The axis into which we gather the indices into

  • clip_to_valid – if True, the indices will be clipped to the valid range of the input Also taking seq lengths into account.

layer_class: Optional[str] = 'gather'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, axis, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.GatherNdLayer(position, **kwargs)[source]

Warning: This layer is deprecated, use the more general GatherLayer instead. GatherLayer should be equivalent, but is more general (supports multiple batch dimensions, can specify gather axis) and its name is less misleading.

This takes out a position from some axis, e.g. x[pos]. This layers allows a different position for each batch. It’s basically a wrapper around tf.gather (the name of this layer is misleading). See also GatherLayer instead, which will replace this layer in the future. See also SliceNdLayer. See also ScatterNdLayer, which is the inverse operation.

Parameters:

position (LayerBase) – indices into first axis (excluding batch) of the input

layer_class: Optional[str] = 'gather_nd'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.ScatterNdLayer(position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, filter_invalid_indices=False, **kwargs)[source]

The inverse of GatherNdLayer. Mostly a wrapper for tf.scatter_nd.

Note that “nd” is maybe a bit misleading. While we operate on N-D tensors, the indices (position) are into a single new dimension.

The input to the layer are the updates, the indices are via the position argument. The indices are into the newly constructed output dimension. The output shape is constructed via the common shape of the input, the position, and the unique common axis (if not unique, we would need to introduce an option to specify it) is replaced by the given output dimension (currently via output_dim_via_time_from).

Examples:

position (indices): (B,eTs)
input (updates): (eTs,D) or (B,eTs,D) -> expanded to (B,eTs,D)
output shape: (B,eT,D)

position (indices): (B,dT,eTs)
input (updates): (eTs,D) -> expanded to (B,dT,eTs,D)
output shape: (B,dT,eT,D)

position (indices): (dT,eTs)
input (updates): (eTs,D) -> expanded to (dT,eTs,D)
output shape: (dT,eTs,D)

position (indices): (dT,eTs)
input (updates): (B,eTs,D) -> expanded to (dT,eTs,B,D)
output shape: (dT,eT,B,D)

In all these examples, output_dim_via_time_from is (B,eT,F), and eTs gets replaced by eT.

Parameters:
  • position (LayerBase) – indices into first axis (excluding batch) of the output

  • position_axis (Dim|str) – axis in position to replace by the output-dim

  • output_dim_via_time_from (LayerBase|None) – use the time-dim from this layer as the output-dim

  • out_spatial_dim (Dim|None)

  • filter_invalid_indices (bool) – allow for indices <0 or >= output_dim, which will be discarded in the output

layer_class: Optional[str] = 'scatter_nd'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources, position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • position (LayerBase)

  • position_axis (Dim|str) – axis in position to replace by the output-dim

  • output_dim_via_time_from (LayerBase|None) – use the time-dim from this layer as the output-dim

  • out_spatial_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.LinearLayer(activation=None, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]

Linear/forward/fully-connected/1x1-conv layer. Does a linear transformation on the feature-dimension of the input with an optional bias term and an optional activation function. See also DotLayer, ElemwiseProdLayer, WeightedSumLayer.

Parameters:
layer_class: Optional[str] = 'linear'[source]
class returnn.tf.layers.basic.SoftmaxLayer(**kwargs)[source]

Just a LinearLayer with activation=”softmax” by default.

Parameters:
layer_class: Optional[str] = 'softmax'[source]
class returnn.tf.layers.basic.LengthLayer(axis='T', add_time_axis=False, dtype='int32', sparse=False, **kwargs)[source]

Returns the length of sources as (B,), via input size_placeholder.

Parameters:
  • axis (str|Dim)

  • add_time_axis (bool) – should not be used

  • dtype (str)

  • sparse (bool)

layer_class: Optional[str] = 'length'[source]
classmethod fixup_dim(dim, sources)[source]
Parameters:
Return type:

Dim

classmethod get_out_data_from_opts(name, sources, axis='T', add_time_axis=False, dtype='int32', sparse=False, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis (str|Dim)

  • add_time_axis (bool)

  • dtype (str)

  • sparse (bool)

Return type:

Data

class returnn.tf.layers.basic.SoftmaxOverSpatialLayer(axis=None, energy_factor=None, start=None, window_start=None, window_size=None, use_time_mask=None, log_space=False, **kwargs)[source]

This applies a softmax over spatial axis/axes (currently only time axis supported). E.g. when the input is of shape (B,T,dim), the output will be (B,T,dim). It automatically masks the frames outside the seq defined by the seq-len. In contrast to SoftmaxLayer, this will not do a linear transformation. See SeqLenMaskLayer if you just want to apply a masking.

Parameters:
  • axis (Dim|str|None) – which axis to do the softmax over. “T” by default

  • energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).

  • start (LayerBase|None) – Tensor of shape (B,) indicating the start frame

  • window_start (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window start.

  • window_size (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window size.

  • use_time_mask (bool) – if True, assumes dyn seq len, and use it for masking. By default, if dyn seq len exists, it uses it.

  • log_space (bool) – if True, returns in log space (i.e. uses log_softmax)

layer_class: Optional[str] = 'softmax_over_spatial'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_out_data_from_opts(name, sources, axis=None, start=None, window_start=None, window_size=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.SeqLenMaskLayer(mask_value, axis='T', seq_len_source=None, start=None, window_start=None, window_size=None, **kwargs)[source]

Masks some values away given the seq_len_source with mask_value. Also see SoftmaxOverSpatialLayer. Also see SwitchLayer, which can be used to apply a generic mask.

Parameters:
  • seq_len_source (LayerBase|None) – if not given, uses source

  • axis (Dim|str)

  • mask_value (float)

  • start (LayerBase|None) – Tensor of shape (B,) indicating the start frame

  • window_start (LayerBase|None) – Tensor of shape (B,) indicating the window start

  • window_size (LayerBase|int|None)

layer_class: Optional[str] = 'seq_len_mask'[source]
classmethod build_mask(x, axis='T', axis_allow_int=<class 'returnn.util.basic.NotSpecified'>, seq_len_source=None, start=None, window_start=None, window_size=None)[source]
Parameters:
  • x (Data)

  • axis (Dim|str|int)

  • axis_allow_int (bool|NotSpecified) – Some callers of this function would pass in an int for axis directly. In that case, explicitly set this to True.

  • seq_len_source (Data|None)

  • start (Data|None)

  • window_start (Data|None)

  • window_size (Data|int|None)

Returns:

mask which is broadcastable to energy_data, thus you can e.g. use returnn.tf.util.basic.where_bc()

Return type:

tf.Tensor

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, start=None, window_start=None, window_size=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.BooleanMaskLayer(*, mask: LayerBase, dims: Sequence[Dim], out_dim: Dim | None = None, **kwargs)[source]

Wrapper around tf.boolean_mask.

Parameters:
  • mask

  • dims

  • out_dim

layer_class: Optional[str] = 'boolean_mask'[source]
get_dep_layers() List[LayerBase][source]

dep layers

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(*, name: str, sources: Sequence[LayerBase], mask: LayerBase, out_dim: Dim | None = None, **kwargs) Tensor[source]
Parameters:
  • name

  • sources

  • mask

  • out_dim

class returnn.tf.layers.basic.RandomStateInitLayer(algorithm=None, seed=None, out_dim=None, **kwargs)[source]

This calculates the initial state value for the state var of RandomLayer. This depends on the algorithm and seed.

Parameters:
  • algorithm (str|tf.random.Algorithm|None) – “philox”, “three-fry”, “auto-select”. by default “philox”. See tf.random.stateless_uniform() for some documentation. “auto-select” will automatically select the optimal algorithm based on the device, so it might select a different algorithm depending on the device. Note that the state shape is dependent on the device, so if you want that checkpoints are compatible across devices, do not use “auto-select”. We take the default from tf.random.Generator.

  • seed (int|Sequence[int]|numpy.ndarray|None) – if given, the state will deterministically depend on this (and the algorithm) and nothing else. If you have multiple random generators (state vars), make sure that you have different seeds for each! If None (default), the seed will be deterministically taken from the network random generator at construction time, which is usually a good idea. You still can change the global network seed.

  • out_dim (Dim|None) – new dim tag for random state dim

layer_class: Optional[str] = 'random_state_init'[source]
classmethod select_algorithm(algorithm)[source]
Parameters:

algorithm (str|int|tf.random.Algorithm|None)

Return type:

int

classmethod get_out_data_from_opts(name, algorithm=None, out_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • algorithm (str|None)

  • out_dim (Dim|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.RandomLayer(shape, distribution, mean=None, stddev=None, bound=None, minval=None, maxval=None, dtype='float32', sparse_dim=None, feature_dim=None, seed=None, algorithm=None, explicit_state=None, auto_update_state=None, static=None, shape_deps=(), stop_grad: bool = False, **kwargs)[source]

Generates random numbers from uniform or normal or truncated normal distribution.

This uses the TensorFlow stateless random ops internally, i.e. all the state handling is explicit. The state var can be explicitly provided and initialized via RandomStateInitLayer, or when not provided it will be automatically created.

There are two possible distinct use cases:

  • For any randomness in the model, e.g. dropout. So each session.run step will produce a new random number and advance the random state.

  • To initialize parameters via the config, using VariableLayer with the init_by_layer option. This will only be called once when initializing the parameters. For this use case, we do not want to keep a random state var. You can just pass static=False. Alternatively you could also pass the output of a RandomStateInitLayer as state.

Parameters:
  • shape (Sequence[Dim|int])

  • distribution (str) – “uniform”, “normal” or “truncated_normal”

  • mean (int|float|LayerBase|None)

  • stddev (int|float|LayerBase|None)

  • bound (int|float|LayerBase|None) – for uniform, defining the range [-bound, bound)

  • minval (int|float|LayerBase|None) – for uniform

  • maxval (int|float|LayerBase|None) – for uniform

  • dtype (str)

  • sparse_dim (Dim|None)

  • feature_dim (Dim|None)

  • seed (int|list[int]|numpy.ndarray|None) – If not given, uses self.network.random.randint, i.e. then it is controlled by the global seed setting, and every layer would get its own seed. If you specify it explicitly, make sure every RandomLayer uses a different seed, otherwise you would get the same random numbers everywhere.

  • algorithm (str|tf.random.Algorithm|None) – see RandomStateInitLayer

  • explicit_state (LayerBase|None) – You can pass the state explicitly here. If not given, will be created automatically, and updated automatically. You could pass a VariableLayer with initial value via RandomStateInitLayer, or directly a RandomStateInitLayer. If auto_update_state is True, it must be a variable, and every time a new random number is created, this variable is updated. Otherwise (default) it will not be updated automatically.

  • auto_update_state (bool|None) – only used when you pass an explicit state

  • static (bool|None) – if no state at all should be used. it just relies on the seed then.

  • shape_deps (list[LayerBase]) – for dyn dim tags in shape

  • stop_grad (bool) – if True, will stop the gradient to mean,stddev,bound,minval,maxval

layer_class: Optional[str] = 'random'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, shape, dtype='float32', sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]
Parameters:
  • name (str)

  • shape (Sequence[Dim|int])

  • dtype (str)

  • sparse_dim (Dim|None)

  • feature_dim (Dim|None)

  • shape_deps (list[LayerBase]) – for dyn dim tags in shape

Return type:

Data

class returnn.tf.layers.basic.RandIntLayer(shape, maxval, minval=0, dtype='int32', sparse_dim=None, seed=None, **kwargs)[source]

Generates random integer numbers using tf.random.uniform. It is recommended to use RandomLayer instead.

Parameters:
  • shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor

  • maxval (int|LayerBase) – upper bound (exclusive) on range of random values

  • minval (int|LayerBase) – lower bound (inclusive) on range of random values

  • dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats

  • sparse_dim (Dim|None)

  • seed (int|None) – random seed

layer_class: Optional[str] = 'rand_int'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, shape, maxval, minval=0, dtype='int32', sparse_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor

  • maxval (int|LayerBase) – upper bound (exclusive) on range of random values

  • minval (int|LayerBase) – lower bound (inclusive) on range of random values

  • dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats

  • sparse_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.RangeLayer(limit, start=0, delta=1, dtype=None, sparse=False, out_spatial_dim=None, **kwargs)[source]

Generic wrapper around tf.range. See also RangeInAxisLayer.

Parameters:
  • limit (int|float)

  • start (int|float)

  • delta (int|float)

  • dtype (str|None)

  • sparse (bool)

  • out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'range'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, limit, start=0, delta=1, dtype=None, sparse=False, out_spatial_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • limit (int|float)

  • start (int|float)

  • delta (int|float)

  • dtype (str|None)

  • sparse (bool)

  • out_spatial_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.RangeInAxisLayer(axis, dtype='int32', unbroadcast=False, keepdims=False, sparse=False, **kwargs)[source]

Assume that the input is e.g. (B,T,D), and you specify axis=”T”, you will get (T,), where the specified axis is filled with tf.range. See also RangeLayer.

Parameters:
  • axis (str|Dim)

  • dtype (str)

  • unbroadcast (bool) – DEPRECATED, unsupported, and not needed

  • keepdims (bool) – DEPRECATED, unsupported, and not needed

  • sparse (bool)

layer_class: Optional[str] = 'range_in_axis'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, axis, dtype='int32', sparse=False, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis (str|Dim)

  • dtype (str)

  • sparse (bool)

class returnn.tf.layers.basic.RangeFromLengthLayer(dtype='int32', sparse=False, out_spatial_dim=None, **kwargs)[source]

Given some dynamic sequence lengths as input, this creates a tf.range over the implied dimension. As a side effect, this can create a new dyn dim tag for the given sequence lengths. This side effect can be the main functionality in certain use cases. See also RangeInAxisLayer.

Consider the example:

y: {class: range_in_axis, from: x, axis: T}

This is basically equivalent to:

x_len: {class: length, from: x}
y: {class: range_from_length, from: x_len}
Parameters:
  • axis (str)

  • dtype (str)

  • sparse (bool)

  • out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'range_from_length'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, dtype='int32', sparse=False, out_spatial_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • dtype (str)

  • sparse (bool)

  • out_spatial_dim (Dim|None)

class returnn.tf.layers.basic.BatchSoftmaxLayer(**kwargs)[source]

Softmax over spacial and feature axis

Parameters:
  • in_dim (Dim|None)

  • out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)

  • dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training

  • dropout_axis (Dim|str|list[Dim|str]|None)

  • dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()

  • dropout_on_forward (bool) – apply dropout during inference

  • mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'batch_softmax'[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.ConstantLayer(sources, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]

Output is a constant value.

Parameters:
  • sources (list[LayerBase])

  • value (int|float|bool|numpy.ndarray)

  • shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags

  • dtype (str|None)

  • with_batch_dim (bool)

  • sparse_dim (Dim|None)

  • feature_dim (Dim|None)

  • shape_deps (list[LayerBase]) – for dyn dim tags in shape

layer_class: Optional[str] = 'constant'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=<class 'returnn.util.basic.NotSpecified'>, shape_deps=(), **kwargs)[source]
Parameters:
  • name (str)

  • value (int|float|bool)

  • shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags

  • dtype (str|None)

  • with_batch_dim (bool)

  • sparse_dim (Dim|None)

  • feature_dim (Dim|None|NotSpecified)

  • shape_deps (list[LayerBase]) – for dyn dim tags in shape

Return type:

Data

class returnn.tf.layers.basic.GatingLayer(activation, gate_activation='sigmoid', out_dim=None, **kwargs)[source]

Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then element-wise multiplies them. Thus, the output dimension is input-dimension / 2.

Parameters:
  • activation (str)

  • gate_activation (str)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'gating'[source]
classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.WindowLayer(window_size=None, window_dim=None, window_left=None, window_right=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]

Adds a window dimension. By default, uses the time axis and goes over it with a sliding window. The new axis for the window is created right after the time axis. In PyTorch, this is called unfold. We sometimes call this “chunking”. There is also the similar TimeChunkingLayer.

E.g. if the input is (batch, time, dim), the output is (batch, time, window_size, dim). If you want to merge the (window_size, dim) together to (window_size * dim,), you can use the MergeDimsLayer, e.g. {“class”: “merge_dims”, “axes”: “except_time”}.

Use stride==window_size and window_right=window_size - 1 in combination with a MergeDimsLayer to achieve feature stacking with right-hand zero padding.

This is not to take out a single window from the time-dimension. See SliceLayer or SliceNdLayer.

The inverse layer is FoldLayer.

Parameters:
  • window_size (int|None)

  • window_dim (Dim|None)

  • window_left (int|None)

  • window_right (int|None)

  • axis (Dim|str) – see Data.get_axis_from_description()

  • out_spatial_dim (Dim|None)

  • padding (str) – “same” or “valid”

  • stride (int) – return only each Nth window

  • _use_opt_dim_order (bool|None)

layer_class: Optional[str] = 'window'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, network, sources, window_size=None, window_dim=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, window_size=None, window_dim=None, axis='T', sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.basic.FoldLayer(mode: str, in_spatial_dim: Dim | str, window_dim: Dim | str, out_spatial_dim: Dim | None = None, padding: str = 'same', window_left: int | None = None, window_right: int | None = None, stride: int = 1, **kwargs)[source]

The inverse of WindowLayer. We sometimes call this “unchunking”. The TimeUnChunkingLayer is similar.

Input (in_spatial_dim, window_dim, other_dims…) -> output (out_spatial_dim, other_dims…).

The window_dim is folded into the out_spatial_dim. This is also similar as the PyTorch fold operation (with mode=”sum”).

Parameters:
  • mode – “sum” or “mean” (average), for overlapping frames

  • in_spatial_dim

  • window_dim

  • out_spatial_dim

  • padding

  • window_left

  • window_right

  • stride

layer_class: Optional[str] = 'fold'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name: str, sources: List[LayerBase], in_spatial_dim: Dim | str, window_dim: Dim | str, out_spatial_dim: Dim | None = None, padding: str = 'same', window_left: int | None = None, window_right: int | None = None, stride: int = 1, **kwargs) Tensor[source]

out data

class returnn.tf.layers.basic.CumsumLayer(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]

Basically wraps tf.cumsum. Also supports that in the RecLayer.

Parameters:
  • axis (str) – see Data.get_axis_from_description()

  • additional_left_summand_per_element (str|int|float|None) – the order matters for tf.string

  • reverse (bool)

layer_class: Optional[str] = 'cumsum'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, axis='T', **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis (str)

Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, axis='T', sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.basic.PadLayer(*, axes: Dim | str | Sequence[Dim | str], padding: int | Tuple[int, int] | Sequence[Tuple[int, int]], out_dims: Dim | Sequence[Dim] | None = None, handle_dynamic_dims: bool | None = None, value: int | float = 0, mode: str = 'constant', **kwargs)[source]

Adds (e.g. zero) padding in some axis or axes. Also see PrefixInTimeLayer for dynamic dims.

Parameters:
  • axes – e.g. “F” etc. see Data.get_axes_from_description().

  • padding – how much to pad left/right in each axis

  • out_dims

  • handle_dynamic_dims – True: when doing right padding on a dynamic dim, value will be added after the seq end, not at the end of the dimension. False: value will be added at the end of the dimension. By default, in behavior version >=21, this is True, in older versions, this is False.

  • value – what constant value to pad, with mode==”constant”

  • mode – “constant”, “reflect”, “symmetric” and “replication”

layer_class: Optional[str] = 'pad'[source]
classmethod get_out_data_from_opts(name, sources, axes, padding, out_dims=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axes (Dim|str|list[Dim|str])

  • padding (list[(int,int)]|(int,int)|int)

  • out_dims (Dim|list[Dim]|None)

Return type:

Data

class returnn.tf.layers.basic.MergeDimsLayer(axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, n_out=None, out_dim=None, **kwargs)[source]

Merges a list of axes into a single one. (Flatten the dims.) E.g. input is (batch, width, height, dim) and axes=(1,2), then we get (batch, width*height, dim). Or input is (batch, time, height, dim) and axes=”except_time”, then we get (batch, time, height*dim). See also CombineDimsLayer. When batch and time got merged, SplitBatchTimeLayer can undo this. When you want to merge batch and time, but remove the padding efficiently, i.e. flatten it, see FlattenBatchLayer.

Parameters:
  • axes (Sequence[Dim|str]) – see Data.get_axis_from_description()

  • keep_order (bool|NotSpecified) – The old default was: the axes are sorted, and then merged. Thus, the order of incoming axes will influence the result. E.g. inputs [B,S,F] and [B,F,S], with axes=["S","F"], will get different results, although the output shape is [B,S*F] in both cases. This is bad: In general, other layers in RETURNN might reorder the axes for various reasons, and all layers should behave in the same way, no matter the order. It is recommended to set keep_order=True, such that the order defined in axes defines the behavior, and not the incoming axis order. Since behavior version 6, this is already the case.

  • n_out (int|None)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'merge_dims'[source]
classmethod get_out_data_from_opts(name, axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, sources=(), n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.SplitLayer(axis=None, num_splits=None, size_splits=None, out_dims=None, **kwargs)[source]

Splits one axis into multiple parts, via tf.split. self.output is simply the input copied. Each part can be accessed via the sublayers “/%i”.

Parameters:
  • axis (str|None) – feature axis by default

  • num_splits (int|None)

  • size_splits (list[int]|None)

  • out_dims (list[Dim]|None)

layer_class: Optional[str] = 'split'[source]
get_sub_layer(layer_name)[source]
Parameters:

layer_name (str)

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_out_data_from_opts(sources, **kwargs)[source]
Parameters:

sources (list[LayerBase])

Return type:

Data

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

class returnn.tf.layers.basic.SplitDimsLayer(axis, dims, pad_to_multiples=None, pad_value=0, **kwargs)[source]

Splits one axis into multiple axes. E.g. if you know that your feature-dim is composed by a window, i.e. the input is (batch, time, window * feature), you can set axis=”F”, dims=(window, -1), and you will get the output (batch, time, window, feature).

If the split axis has a dynamic length, exactly one of the axes that we split into need to also have a dynamic length. You can e.g. use this to split the input dimension into smaller “chunks” of a fixed window size. E.g. you could have input (batch, time, feature) and set axis=”T”, dims=(-1, window), to get output (batch, split_time, window, feature). In this case, the exact sequence lengths are lost and everything is padded to multiples of the window size using the given padding value. Use ReinterpretDataLayer to receive back the original sequence lengths after merging.

Also see SplitBatchTimeLayer. Also see MergeDimsLayer which can undo this operation.

Parameters:
  • axis (Dim|str) – e.g. “F”

  • dims (tuple[Dim|int]|list[Dim|int]) – what the axis should be split into. e.g. (window, -1)

  • pad_to_multiples (bool|None) – If true, input will be padded to the next multiple of the product of the static dims, such that splitting is actually possible. By default this is done iff the axis has a dynamic size

  • pad_value (int|float) – What pad value to use for pad_to_multiples

layer_class: Optional[str] = 'split_dims'[source]
classmethod get_out_data_from_opts(name, axis, dims, pad_to_multiples=None, sources=(), **kwargs)[source]
Parameters:
  • name (str)

  • axis (Dim|str)

  • dims (list[Dim|int]|tuple[Dim|int])

  • pad_to_multiples (bool|None)

  • sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.SplitBatchTimeLayer(base, **kwargs)[source]

A very specific layer which expects to get input of shape (batch * time, …) and converts it into (batch, time, …), where it recovers the seq-lens from some other layer. See SplitDimsLayer for a more generic layer.

Parameters:

base (LayerBase) – used to recover the seq-lens

layer_class: Optional[str] = 'split_batch_time'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, base, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.ReshapeLayer(in_dims, out_dims, extra_deps=(), **kwargs)[source]

Allows to reshape (…, in_dims, …) to (…, out_dims, …) as long as prod(in_dims) == prod(out_dims).

in_dims don’t need to be directly behind each other or in that order – internally it will permute it such that it is in the right order. out_dims should be defined.

This can be used for clever indexing, slicing, padding tricks. It can also be used as an alternative to SplitDimsLayer or MergeDimsLayer.

Parameters:
  • in_dims (Sequence[Dim|str])

  • out_dims (Sequence[Dim|str])

  • extra_deps (Sequence[LayerBase]) – Just add as an additional dependency, without really using it. This is to potentially define otherwise unknown out_dims.

layer_class: Optional[str] = 'reshape'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, in_dims, out_dims, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • in_dims (Sequence[Dim|str])

  • out_dims (Sequence[Dim|str])

class returnn.tf.layers.basic.FlattenBatchLayer(axis='T', batch_major=True, **kwargs)[source]

Merges one axis into the batch axis. If the axis has dynamic lengths, this would use flattening, i.e. recalculate the padding, i.e. the size changes. This basically wraps flatten_with_seq_len_mask() or flatten_with_seq_len_mask_time_major(). See also MergeDimsLayer, which does not do flattening, i.e. the size stays the same.

Parameters:
  • axis (str)

  • batch_major (bool) – if False, will flatten in time-major manner

layer_class: Optional[str] = 'flatten_batch'[source]
classmethod get_out_data_from_opts(sources, name, axis='T', batch_major=True, **kwargs)[source]
Parameters:
  • sources (list[LayerBase])

  • name (str)

  • axis (str)

  • batch_major (bool) – if False, will flatten in time-major manner

Return type:

Data

class returnn.tf.layers.basic.UnflattenBatchLayer(**kwargs)[source]

Inverse of FlattenBatchLayer, so recovers an axis previously merged into the batch axis

This basically wraps unflatten_with_seq_len_mask().

Parameters:
  • in_dim (Dim|None)

  • out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)

  • dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training

  • dropout_axis (Dim|str|list[Dim|str]|None)

  • dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()

  • dropout_on_forward (bool) – apply dropout during inference

  • mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'unflatten_batch'[source]
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.UnflattenNdLayer(sizes, num_axes, in_dim='T', out_dims=None, declare_same_sizes_as=None, **kwargs)[source]

This keeps the batch axis as-is, i.e. the flattening/unflattening did not happen on the batch axis.

Example:

Assumes that the input is of shape (B,T,<Ds>) which represents flattened images, where each image is of size width * height. We additionally provide these image sizes (shape (B,2)), i.e. (width,height) tuples. We return the unflattened images of shape (B,W,H,<Ds>), where W/H are the max width/height.

This basically wraps returnn.tf.util.basic.unflatten_nd().

Parameters:
  • sizes (LayerBase)

  • num_axes (int)

  • in_dim (Dim|str|None)

  • out_dims (list[Dim]|None)

  • declare_same_sizes_as (dict[int,LayerBase]|None)

layer_class: Optional[str] = 'unflatten_nd'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, num_axes, in_dim='T', out_dims=None, declare_same_sizes_as=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • num_axes (int)

  • in_dim (Dim|str|None)

  • out_dims (list[Dim]|None)

  • declare_same_sizes_as (dict[int,LayerBase]|None)

Return type:

Data

class returnn.tf.layers.basic.ExpandDimsLayer(axis, dim=1, **kwargs)[source]

Adds some axis.

Parameters:
  • axis (str|int) – axis to add, e.g. “F”|”feature” or “spatial”|”time”|”T”. if this is an integer, the input data is first converted into batch-major mode, and then this is counted with batch-dim.

  • dim (int|Dim) – dimension of new axis (1 by default)

layer_class: Optional[str] = 'expand_dims'[source]
classmethod get_out_data_from_opts(name, axis, dim=1, sources=(), **kwargs)[source]
Parameters:
  • name (str)

  • axis (str|int)

  • dim (int|Dim)

  • sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.RepeatLayer(repetitions, axis='T', out_dim=None, **kwargs)[source]

A wrapper around tf.repeat, but supports an additional batch axis for the durations The sum of the repetitions has to be non-zero for each sequence in the batch.

This layer can only be used with Tensorflow 1.15.0 or newer.

Parameters:
  • repetitions (LayerBase|int) – number of repetitions for each sequence and position in target axis. Can be [B,T] or [T,B] or some subset of that shape

  • axis (Dim|str) – (dynamic) axis for repetition (currently only time axis is supported)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'repeat'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, axis, repetitions, out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.TileLayer(multiples, out_dims=None, **kwargs)[source]

A wrapper around tf.tile

Parameters:
  • multiples (dict[Dim|str, int]) – number of multiples per axis (axis provided as dim tag or str desc)

  • out_dims (dict[Dim|str, Dim]|None)

layer_class: Optional[str] = 'tile'[source]
classmethod get_out_data_from_opts(name, sources, multiples, out_dims=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • multiples (dict[Dim|str, int])

  • out_dims (dict[Dim|str, Dim]|None)

Return type:

Data

class returnn.tf.layers.basic.CastLayer(dtype, output, **kwargs)[source]

Cast to some other dtype.

Parameters:
  • dtype (str)

  • output (Data)

layer_class: Optional[str] = 'cast'[source]
classmethod get_out_data_from_opts(dtype, **kwargs)[source]
Parameters:

dtype (str)

Return type:

Data

class returnn.tf.layers.basic.SwapAxesLayer(axis1, axis2, **kwargs)[source]

Swaps two axes. Basically a wrapper around returnn.tf.util.basic.swapaxes(). Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.

All axes always have a special meaning (e.g. feature dim or time dim) or dimension tag (e.g. for time axes, including dyn seq lengths). If you need to change the meaning (and not actually transpose / swap axes), you need to use ReinterpretDataLayer.

See also TransposeLayer for a more generic variant.

See also ReinterpretDataLayer, which does not swap/transpose axes, but allows to reinterpret their meaning / dim tags.

Parameters:
  • axis1 (int|str)

  • axis2 (int|str)

layer_class: Optional[str] = 'swap_axes'[source]
classmethod get_out_data_from_opts(name, sources, axis1, axis2, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis1 (int|str)

  • axis2 (int|str)

Return type:

Data

class returnn.tf.layers.basic.TransposeLayer(perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim], **kwargs)[source]

Basically a wrapper around tf.transpose().

Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.

All axes always have a special meaning (e.g. feature dim or time dim) or dimension tag (e.g. for time axes, including dyn seq lengths). If you need to change the meaning (and not actually transpose / swap axes), you need to use ReinterpretDataLayer.

See also ReinterpretDataLayer, which does not transpose axes, but allows to reinterpret their meaning / dim tags.

One valid use case is to use this for the final output layer, to make sure the output is in the correct format.

Parameters:

perm – target axis -> source axis

layer_class: Optional[str] = 'transpose'[source]
classmethod transpose(input_data: Tensor, perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim], name: str | None = None) Tensor[source]
Parameters:
  • input_data

  • perm

  • name

Returns:

transposed data

classmethod get_perm_int(input_data: Tensor, perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim]) List[int][source]
Parameters:
  • input_data

  • perm

classmethod get_out_data_from_opts(name, sources, perm, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • perm (dict[str,str]) – target axis -> source axis

Return type:

Data

class returnn.tf.layers.basic.ReinterpretDataLayer(switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]

Acts like the CopyLayer but reinterprets the role of some axes or data.

Parameters:
  • switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes

  • size_base (LayerBase|None) – copy the size_placeholder from the given layer

  • batch_dim_base (LayerBase|None) – copy the batch dim from this layer

  • set_axes (dict[str,Dim|str|None]) – This can be used to overwrite the special axes like time_dim_axis or feature_dim_axis. For that, use keys “B”,”T” or “F”, and a value via Data.get_axis_from_description().

  • set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None) – axis -> new dim tag. assigns new dim tags. If the passed dim tag is yet undefined, this will not use same_dim_tags_as (declare_same_as) but create a new dim tag. This option is useful for generalized self attention (https://github.com/rwth-i6/returnn/issues/391).

  • enforce_batch_major (bool)

  • enforce_time_major (bool)

  • set_sparse (bool|None) – if bool, set sparse value to this

  • set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse

  • increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse

layer_class: Optional[str] = 'reinterpret_data'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes

  • size_base (LayerBase|None) – similar as size_target

  • batch_dim_base (LayerBase|None)

  • set_axes (dict[str,Dim|str|None])

  • set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None)

  • enforce_batch_major (bool)

  • enforce_time_major (bool)

  • set_sparse (bool|None) – if bool, set sparse value to this

  • set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse

  • increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse

class returnn.tf.layers.basic.ConvLayer(filter_size, padding, strides=1, dilation_rate=1, groups=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, with_bias=<class 'returnn.util.basic.NotSpecified'>, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, pad_seq_len_to_power=None, **kwargs)[source]

A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate “pool” layer.

Parameters:
  • filter_size (Sequence[Dim]|Sequence[int]) – (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. The input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. It will automatically swap the batch-dim to the first axis of the input data.

  • padding (str) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.

  • strides (int|Sequence[int]) – strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.

  • dilation_rate (int|Sequence[int]) – dilation for the spatial dims

  • groups (int) – grouped convolution

  • in_dim (Dim|None)

  • in_spatial_dims (Sequence[Dim|str]|None)

  • n_out (int|None) – number of outgoing features

  • out_dim (Dim|None)

  • out_spatial_dims (Sequence[Dim]|None)

  • input_expand_dims (int) – number of spatial dims to add to the input

  • input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.

  • input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.

  • auto_use_channel_first (bool|NotSpecified) – convert the input to NCHW or not

  • with_bias (bool|NotSpecified) – if True, will add a bias to the output features. True by default since behavior version 10.

  • activation (None|str) – if set, will apply this function at the end

  • filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter

  • filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)

  • bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias

  • use_time_mask (bool)

  • pad_seq_len_to_power (Optional[float]) – pad sequence length to power of given number to reduce number of different sequence lengths. See https://github.com/rwth-i6/returnn/issues/1450 and https://github.com/tensorflow/tensorflow/issues/62441.

layer_class: Optional[str] = 'conv'[source]
recurrent = True[source]
classmethod set_output_dim_tags(output, num_batch_dims, in_spatial_dims, out_spatial_dims, filter_size, strides, dilation_rate, padding)[source]
Parameters:
  • output (Data)

  • num_batch_dims (int)

  • in_spatial_dims (Sequence[Dim])

  • out_spatial_dims (Sequence[Dim]|None)

  • filter_size (Sequence[int|Dim])

  • strides (Sequence[int])

  • dilation_rate (Sequence[int])

  • padding (str)

classmethod transform_input(input_data, network, in_dim=None, in_spatial_dims=None, input_expand_dims=0, input_split_feature_dim=None, input_add_feature_dim=False, use_time_mask=False)[source]
Parameters:
  • input_data (Data)

  • network (returnn.tf.network.TFNetwork)

  • in_dim (Dim|None)

  • in_spatial_dims (list[Dim|str]|None)

  • input_expand_dims (int) – number of spatial dims to add to the input

  • input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.

  • input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.

  • use_time_mask (bool)

Returns:

(transformed input, num batch dims). all batch dims are at the front

Return type:

(Data, int)

classmethod get_input_placeholder_with_same_static_padding(input_data: Tensor, num_batch_dims: int, filter_size: Sequence[int], strides: Sequence[int], out_batch_feature_major: bool) Tensor[source]

Returns the placeholder of input_data with same_static padding applied to it.

Parameters:
  • input_data

  • num_batch_dims

  • filter_size

  • strides

  • out_batch_feature_major

classmethod calc_out_dim(in_dim, filter_size, stride, padding, dilation_rate=1)[source]
Parameters:
  • in_dim (T|int|tf.Tensor|Dim) – dimension in some axis

  • filter_size (int|Dim) – e.g. 2, for the corresponding axis

  • stride (int) – e.g. 1, for the corresponding axis

  • dilation_rate (int) – e.g. 1

  • padding (str) – “valid” or “same”

Returns:

the output dimension

Return type:

T

classmethod get_out_data_from_opts(name, sources, network, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str)

  • sources (Sequence[LayerBase])

  • network (returnn.tf.network.TFNetwork)

  • filter_size (Sequence[int|Dim])

  • padding (str)

  • strides (int|Sequence[int])

  • dilation_rate (int|Sequence[int])

  • input_expand_dims (int) – number of dynamic dims to add to the input

  • input_add_feature_dim (bool)

  • input_split_feature_dim (None|int)

  • in_dim (Dim|None)

  • in_spatial_dims (Sequence[Dim|str]|None)

  • n_out (int|None) – number of outgoing features

  • out_dim (Dim|None)

  • out_spatial_dims (Sequence[Dim]|None)

  • input_expand_dims – number of spatial dims to add to the input

  • auto_use_channel_first (bool|NotSpecified)

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.PoolLayer(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, use_time_mask=False, **kwargs)[source]

A generic N-D pooling layer. This would usually be done after a convolution for down-sampling.

Parameters:
  • mode (str) – “max” or “avg”

  • pool_size (tuple[int]) – shape of the window of each reduce

  • padding (str) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.

  • dilation_rate (tuple[int]|int)

  • strides (tuple[int]|int|None) – in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size

  • in_dim (Dim|None)

  • in_spatial_dims (list[Dim|str]|None)

  • out_dim (Dim|None)

  • out_spatial_dims (list[Dim]|None)

  • use_channel_first (bool|NotSpecified) – if set, will transform input to NCHW format

  • use_time_mask (bool)

layer_class: Optional[str] = 'pool'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, network, pool_size, strides=None, dilation_rate=1, padding='VALID', in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • network (returnn.tf.network.TFNetwork)

  • pool_size (tuple[int]|list[int])

  • strides (tuple[int]|list[int]|int)

  • dilation_rate (int|tuple[int]|list[int])

  • padding (str)

  • in_dim (Dim|None)

  • in_spatial_dims (list[Dim|str]|None)

  • out_dim (Dim|None)

  • out_spatial_dims (list[Dim]|None)

  • use_channel_first (bool|NotSpecified)

Return type:

Data

class returnn.tf.layers.basic.DctLayer(type=2, n=None, norm=None, **kwargs)[source]

Layer to perform DCT Wraps tf.signal.dct(). For further documentation on the input arguments, refer to https://www.tensorflow.org/api_docs/python/tf/signal/dct

Parameters:
  • type (int) – DCT type to perform. Must be 1, 2, 3, or 4

  • n (int|None) – length of the transform

  • norm (str|None) – normalization to apply. Must be None or “ortho”

layer_class: Optional[str] = 'dct'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.TransposedConvLayer(filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, with_bias=True, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, **kwargs)[source]

Transposed convolution, sometimes also called deconvolution. See tf.nn.conv2d_transpose() (currently we support 1D/2D).

Parameters:
  • filter_size (list[int])

  • strides (list[int]|None) – specifies the upscaling. by default, same as filter_size

  • padding (str) – “same” or “valid”

  • remove_padding (list[int]|int)

  • output_padding (list[int|None]|int|None)

  • in_dim (Dim|None)

  • in_spatial_dims (list[Dim|str]|None)

  • out_dim (Dim|None)

  • out_spatial_dims (list[Dim]|None)

  • with_bias (bool) – whether to add a bias. enabled by default.

  • activation (str|None)

  • forward_weights_init

  • bias_init

  • filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter

  • filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)

  • bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias

  • use_time_mask (bool)

layer_class: Optional[str] = 'transposed_conv'[source]
recurrent = True[source]
static deconv_output_length(input_length, filter_size, padding, output_padding=None, stride=0, dilation=1, out_dim=None)[source]

Determines output length of a transposed convolution given input length. Copied from conv_utils.deconv_output_length, adapted with simplification.

Also see ConvLayer.calc_out_dim().

Parameters:
  • input_length (T|int|tf.Tensor|Dim)

  • filter_size (int)

  • padding (str) – one of “same”, “valid”, “full”.

  • output_padding (int|None) – amount of padding along the output dimension. Can be set to None in which case the output length is inferred.

  • stride (int)

  • dilation (int)

  • out_dim (Dim|None)

Returns:

The output length (integer)

Return type:

T

classmethod get_out_data_from_opts(name, sources, network, filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, n_out=None, out_dim=None, out_spatial_dims=None, in_dim=None, in_spatial_dims=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • network (returnn.tf.network.TFNetwork)

  • filter_size (list[int])

  • strides (list[int]|None)

  • padding (str)

  • remove_padding (list[int]|int)

  • output_padding (list[int|None]|int|None)

  • n_out (int|None) – number of outgoing features

  • out_dim (Dim|None)

  • out_spatial_dims (list[Dim]|None)

  • in_dim (Dim|None)

  • in_spatial_dims (list[Dim|str]|None)

Return type:

Data

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.basic.ReduceLayer(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]

This reduces some axis by using e.g. “sum” or “max”. It’s basically a wrapper around tf.reduce_sum or tf.reduce_max.

Parameters:
  • mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”

  • axes (Sequence[Dim|str]) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().

  • axis (Dim|str) – for compatibility, can be used instead of axes

  • keep_dims (bool) – if dimensions should be kept (will be 1)

  • enforce_batch_dim_axis (int|None) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.

  • use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

layer_class: Optional[str] = 'reduce'[source]
classmethod reduce(input_data, mode, axes=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None)[source]
Parameters:
  • input_data (Data)

  • mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”

  • axes (int|list[int]|str) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().

  • keep_dims (bool) – if dimensions should be kept (will be 1)

  • enforce_batch_dim_axis (int) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.

  • use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

Return type:

tf.Tensor

classmethod need_enforce_batch_dim_axis(axes)[source]
Parameters:

axes (int|list[int]|str|Dim)

Returns:

if any integer is in axes, thus we should have a fixed dimension layout

Return type:

bool

classmethod get_axes(axis, input_data)[source]
Parameters:
  • axis – see self.__init__()

  • input_data (Data)

Returns:

list of axes

Return type:

list[int]

classmethod get_out_data_from_opts(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • mode (str) – (default here “” because other code uses this function)

  • axes (str|list[str]|None)

  • axis (str|None)

  • keep_dims (bool)

  • enforce_batch_dim_axis (int|None)

Return type:

Data

class returnn.tf.layers.basic.ReduceOutLayer(mode, num_pieces, out_dim=None, **kwargs)[source]

Combination of SplitDimsLayer applied to the feature dim and ReduceLayer applied to the resulting feature dim. This can e.g. be used to do maxout.

Parameters:
  • mode (str) – “sum” or “max” or “mean”

  • num_pieces (int) – how many elements to reduce. The output dimension will be input.dim // num_pieces.

  • out_dim (Dim|None)

layer_class: Optional[str] = 'reduce_out'[source]
classmethod get_out_data_from_opts(num_pieces, sources, name, out_dim=None, **kwargs)[source]
Parameters:
  • num_pieces (int)

  • sources (list[LayerBase])

  • name (str)

  • out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.SqueezeLayer(axis, enforce_batch_dim_axis=None, allow_no_op=False, **kwargs)[source]

Removes an axis with dimension 1. This is basically a wrapper around tf.squeeze.

Parameters:
  • axis (Dim|int|list[int]|str) – one axis or multiple axis to squeeze. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”

  • enforce_batch_dim_axis (int|None)

  • allow_no_op (bool)

layer_class: Optional[str] = 'squeeze'[source]
classmethod get_out_data_from_opts(axis, enforce_batch_dim_axis=None, allow_no_op=False, sources=(), **kwargs)[source]
Parameters:
  • axis (Dim|int|list[int]|str)

  • enforce_batch_dim_axis (int|None)

  • allow_no_op (bool)

  • sources (list[LayerBase])

Return type:

Data

class returnn.tf.layers.basic.StackLayer(axis=None, out_spatial_dim=None, **kwargs)[source]

Stacks multiple inputs together using tf.stack(). This creates a new dimension for the stack.

For concatenation (in feature dimension), see CopyLayer.

Parameters:
  • axis (int|None) – new axis. If not given, will use Data.get_default_new_axis_for_dim_tag(<spatial>), i.e. some reasonable default for a new spatial axis.

  • out_spatial_dim (Dim|None)

layer_class: Optional[str] = 'stack'[source]
classmethod get_out_data_from_opts(name, sources, axis=None, out_spatial_dim=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axis (int|None)

  • out_spatial_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.WeightedSumLayer(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]

Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use ElemwiseProdLayer and ReduceLayer, or just a DotLayer with a VariableLayer. See also LinearLayer.

Parameters:
  • axes (str|list[str]) – the axes to do the weighted-sum over

  • padding (str) – “valid” or “same”, in case of keep_dims=True

  • size (None|tuple[int]) – the kernel-size. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=”valid” by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.

  • keep_dims (bool) – if False, the axes will be squeezed away. see also size.

layer_class: Optional[str] = 'weighted_sum'[source]
classmethod get_out_data_from_opts(name, sources, axes, padding=None, size=None, keep_dims=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • axes (str|list[str])

  • padding (str|None)

  • size (None|tuple[int])

  • keep_dims (bool|None)

Return type:

Data

class returnn.tf.layers.basic.ElemwiseProdLayer(axes, size=None, **kwargs)[source]

Element-wise product in some axes. Microsoft calls this “static attention”, in Deep Conv. NN with Layer-wise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also LinearLayer.

Parameters:
  • axes (str|list[str]) – e.g. “spatial”, but all those axes must be of fixed dimension

  • size (tuple[int]) – for double-checking, you can explicitly provide the size

layer_class: Optional[str] = 'elemwise_prod'[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.PrefixInTimeLayer(axis='T', out_dim=None, prefix=0.0, repeat=1, size_base=None, **kwargs)[source]

Adds some prefix in time dimension. This is kind of the reverse of SliceNdLayer does. Also see PadLayer for static dimensions. Also see PostfixInTimeLayer.

Parameters:
  • axis (Dim|str)

  • out_dim (Dim|None)

  • prefix (float|str) – either some constant or another layer

  • repeat (int|LayerBase) – how often to repeat the prefix

  • size_base (LayerBase|None) – copy seq-lens from here

layer_class: Optional[str] = 'prefix_in_time'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, size_base=None, repeat=1, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.PostfixInTimeLayer(axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]

Adds some postfix in time dimension. Also see PrefixInTimeLayer.

Parameters:
  • axis (Dim|str)

  • out_dim (Dim|None)

  • postfix (float|int|LayerBase) – constant or other layer without time axis to use as postfix

  • repeat (int) – how often to repeat the postfix

layer_class: Optional[str] = 'postfix_in_time'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]
Parameters:
  • axis (Dim|str)

  • out_dim (Dim|None)

  • name (str)

  • sources (list[LayerBase])

  • postfix (float|int|LayerBase) – constant or other layer without time axis to use as postfix

  • repeat (int)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
get_dep_layers()[source]
Return type:

list[LayerBase]

class returnn.tf.layers.basic.TimeChunkingLayer(chunk_size, chunk_step, axis='T', out_dim=None, **kwargs)[source]

Performs chunking in time. See returnn.tf.native_op.chunk(). See also WindowLayer and TimeUnChunkingLayer. It’s very similar to WindowLayer, but we have this case more optimized, and also it modifies the batch dim. The output is of shape (chunk_size, n_batch * n_chunks, …).

Parameters:
  • chunk_size (int) – chunk size or window size

  • chunk_step (int) – chunk step or striding

  • axis (Dim|str)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'time_chunking'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.TimeUnChunkingLayer(chunking_layer, **kwargs)[source]

Performs chunking in time. See TFNativeOp.chunk(). See TimeChunkingLayer.

Parameters:

chunking_layer (TimeChunkingLayer)

layer_class: Optional[str] = 'time_unchunking'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, chunking_layer, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.DotLayer(reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, use_mask: bool = True, debug=False, **kwargs)[source]

This performs a dot-product of two sources. The underlying matmul expects shapes (shared…, I, J) * (shared…, J, K) -> (shared…, I, K). We say that J is the axis to be reduced, I is the var-dim of source 1, and K is the var-dim of source 2. I, J, K can also be multiple axes from the sources. The var-dims don’t need to exist. All other axes (shared…) are expected to match.

You should try to avoid having the same dims in both sources when they are not reduced such that you would end up having some dim twice in the output, e.g. (shared…, I, I). You should avoid this because the dim order should never matter (https://github.com/rwth-i6/returnn/wiki/RETURNN-principles). If you need to perform such an operation, you can use ReinterpretDataLayer to introduce a new dim tag.

The reduce dim can also be the sparse dim of one of the sources. In this case, it behaves like GatherLayer.

Parameters:
  • reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources

  • red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source

  • red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source

  • var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source

  • var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source

  • add_var2_if_empty (bool) – if var2=None, add dim=1 at the end

  • use_mask – If the reduction is over dynamic axes, to get the correct sum reduction, we need to apply masking to one of the inputs. This is done automatically. By disabling this flag, this would be disabled.

  • debug (bool) – will print debug shapes, etc.

Earlier defaults:

red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True.

However, these are bad, for multiple reasons, like using integers, but also in general.

See https://github.com/rwth-i6/returnn/issues/627 for details.

layer_class: Optional[str] = 'dot'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources

  • red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source

  • red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source

  • var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source

  • var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source

  • add_var2_if_empty (bool)

Return type:

Data

class returnn.tf.layers.basic.ShiftAxisLayer(axis, amount, pad=True, pad_value=0, adjust_size_info=True, **kwargs)[source]

Shifts the dimensions in an axis around by slicing and optional padding. This layer may change the axis-dimension.

This name might be confusing. No axis will be shifted here. See SwapAxesLayer for that.

Also see SliceLayer.

Parameters:
  • axis (str|Dim|int) – single axis to shift

  • amount (int) – number of elements to shift (<0 for left-shift, >0 for right-shift)

  • pad (bool) – preserve shape by padding

  • pad_value (int|float|bool) – padding value

  • adjust_size_info (bool) – whether to adjust the size_placeholder

layer_class: Optional[str] = 'shift_axis'[source]
classmethod get_out_data_from_opts(name, sources, amount, axis, pad=True, adjust_size_info=True, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • amount (int)

  • axis (str)

  • pad (bool)

  • adjust_size_info (bool)

Return type:

Data

class returnn.tf.layers.basic.ResizeLayer(factor, axis, out_dim=None, kind='nn', fill_value=None, fill_dropout=None, **kwargs)[source]

Resizes the input, i.e. upsampling or downsampling. Supports different kinds, such as linear interpolation or nearest-neighbor.

Parameters:
  • factor (int|float|LayerBase) – out_len = in_len * factor

  • axis (Dim|str) – the axis to resize

  • out_dim (Dim|None)

  • kind (str) – “linear”, “nn”/”nearest_neighbor”, “cubic”, “fill”

  • fill_value (None|int|float) – if kind==”fill”

  • fill_dropout (float|None) – if set, will dropout in the same axis

layer_class: Optional[str] = 'resize'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(factor, axis, sources, name, fill_dropout=None, out_dim=None, **kwargs)[source]
Parameters:
  • factor (int|float|LayerBase)

  • axis (Dim|str)

  • sources (list[LayerBase])

  • name (str)

  • fill_dropout (float|None)

  • out_dim (Dim|None)

Return type:

Data

class returnn.tf.layers.basic.CombineDimsLayer(**kwargs)[source]

Combines multiple dimensions. See also MergeDimsLayer. This is deprecated in favor of MergeDimsLayer.

Parameters:

axes (int|list[int]|str) – one axis or multiple axis to reduce. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”

layer_class: Optional[str] = 'combine_dims'[source]
classmethod get_out_data_from_opts(**kwargs)[source]
Return type:

Data

class returnn.tf.layers.basic.RemoveLayer(symbol, axis='T', out_dim=None, **kwargs)[source]

Currently, assumes sparse data, and removes a specific symbol from the data.

It is recommended to use MaskedComputationLayer in combination with e.g. a :class:CompareLayer` instead, as this provides more flexibility.

Parameters:
  • symbol (int)

  • axis (Dim|str) – the axis to operate over, to potentially remove frames

  • out_dim (Dim|None) – derived from the dim of axis, the reduced new dim

layer_class: Optional[str] = 'remove'[source]
classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.CombineLayer(kind, sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]

Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the left-hand operator.

Its basic working is similar to the reduce function used in functional programming. Also see ActivationLayer, or CompareLayer.

Parameters:
  • kind (str) – currently accepted values are average, add, sub, mul, truediv, floordiv, mod, pow, maximum, minimum, logical_and, logical_or, squared_difference, or eval, or any function in the tf.math or tf namespace.

  • sources (list[LayerBase])

  • allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.

  • activation (str|None) – if provided, activation function to apply, e.g. “tanh” or “relu”

  • with_bias (bool) – if given, will add a trainable bias tensor

  • eval (str|callable) – for kind=”eval”, will eval this string. or function. see _op_kind_eval()

  • eval_locals (dict[str]|None) – locals for eval

  • eval_for_output_loss (bool) – will do the same eval on layer.output_loss

layer_class: Optional[str] = 'combine'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(network, sources, eval_locals=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]
Parameters:
  • network (returnn.tf.network.TFNetwork)

  • sources (list[LayerBase])

  • eval_locals (dict[str]|None) – locals for eval, will also pass to out_type is out_type is a function

  • n_out (int|None|NotSpecified)

  • allow_broadcast_all_sources (bool|NotSpecified)

  • out_type (dict[str]|None|(()->Data))

  • out_shape (set[Dim|_MarkedDim]|tuple|list|None) – verifies the output shape (dim tags)

Return type:

Data

class returnn.tf.layers.basic.EvalLayer(eval, **kwargs)[source]

Evaluates some string. The CombineLayer provides this functionality, thus this is just a special case of it. Also see ActivationLayer, or CompareLayer.

The output type is defined as a broadcasted extension of all sources. You can overwrite it by (partially) specifying out_type. out_type can also be a generic Python function, returning a Data instance.

Parameters:

eval (str) – will eval this string. see _op_kind_eval()

layer_class: Optional[str] = 'eval'[source]
class returnn.tf.layers.basic.CompareLayer(kind='equal', value=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]

Compares element-wise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.

Example:

{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}

computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical “and” of all comparisons. Note that value is the last element to be compared to.

A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the end-of-sentence token:

"output": {"class": "rec", "from": [], "unit": {
    .
    .
    .
    "end": {"class": "compare", "from": "output", "value": end_of_sentence_id}
}, "target": "classes0"}
Parameters:
  • kind (str) – which comparison operation to use, e.g. “equal”, “greater”, “less” or other supported TF comparison ops

  • value (float|int|None) – if specified, will also compare to this

  • allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.

layer_class: Optional[str] = 'compare'[source]
classmethod get_out_data_from_opts(sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_shape=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.SwitchLayer(condition, true_from, false_from, **kwargs)[source]

Wrapper around tf.where() (or more generically returnn.tf.util.basic.where_bc()), or statically choose a single source if the condition is a callable (…)->bool. (tf.cond is not useful here, as the sources would have been already constructed and computed.)

This layer is also useful for applying any kind of generic masking to the frames. E.g. one could have a layer called “mask” computing a boolean mask for the values stored in another layer “input”. Then use this layer with condition=”mask”, true_from=”input”, false_from=mask_value, to mask out all frames where the mask is false with the mask_value.

See also CondLayer. See also SeqLenMaskLayer if you just want to mask using the sequence lengths.

Parameters:
  • condition (LayerBase|bool) – if callable, expected to be (…)->bool, and called in transform_config_dict

  • true_from (LayerBase|float|int|None)

  • false_from (LayerBase|float|int|None)

layer_class: Optional[str] = 'switch'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, condition, true_from, false_from, **kwargs)[source]
Parameters:
Return type:

Data

get_dep_layers()[source]
Return type:

list[LayerBase]

class returnn.tf.layers.basic.CondLayer(condition, true_layer, false_layer, _condition_network=None, _true_layer_network=None, _false_layer_network=None, _extra_out=None, **kwargs)[source]

See also SwitchLayer, which uses tf.where(). Here, we use tf.cond instead. I.e. the condition has to be a scalar bool, and only the corresponding true/false branch is computed.

true_layer/false_layer are layer dicts, which are in the same namescope as this layer, however, they are in the corresponding control flow context (tf.cond).

You can use SubnetworkLayer inside to embed any more complex logic.

There can be more than one output via sub-layers. Specifically, it will make all from get_available_sub_layer_names() available. In SubnetworkLayer, that are all the output layers in the sub-network.

Parameters:
  • condition (LayerBase|dict[str])

  • true_layer (LayerBase|dict[str])

  • false_layer (LayerBase|dict[str])

  • _extra_out (dict[str,(Data, type, dict[str])])

layer_class: Optional[str] = 'cond'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(true_layer, false_layer, name, network, **kwargs)[source]
Parameters:
Return type:

Data

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str)

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layers()[source]
Return type:

list[LayerBase]

class returnn.tf.layers.basic.TopKLayer(axis, k, k_dim=None, sorted=True, **kwargs)[source]

Basically wraps tf.nn.top_k.

Directly returns the top_k values. The indices are accessible via the “indices” sub-layer.

For an input [B,D] with axis=D, the output and indices values are shape [B,K].

It’s somewhat similar to ReduceLayer with max and argmax. The axis dim is reduced and then a new dim for K is added.

Axis can also cover multiple axes, such as [beam,classes]. In that cases, there is not a single “indices” sub-layer, but sub-layers “indices0” .. “indices{N-1}” corresponding to each axis, in the same order.

All other axes are treated as batch dims.

Parameters:
  • axis (Dim|str|list[Dim|str]) – the axis to do the top_k on, which is reduced

  • k (int|LayerBase) – the “K” in “TopK”

  • k_dim (Dim|None) – the output dim tag corresponding to k

  • sorted (bool)

layer_class: Optional[str] = 'top_k'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, sources, axis, k, k_dim, **kwargs)[source]
Parameters:
Return type:

Data

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – sub layer name

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – sub layer name

  • parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

class returnn.tf.layers.basic.SearchSortedLayer(sorted_sequence, values, axis='T', side='left', **kwargs)[source]

Basically wraps tf.searchsorted().

Takes a tensor sorted_sequence that is sorted along one axis, and a tensor values. Will compute an output tensor with the same axes as values, where each entry is the index of the value within the sorted sequence. All (batch) axes of sorted_sequence except for the axis it is sorted along must be present in values.

Parameters:
  • sorted_sequence (LayerBase)

  • values (LayerBase) – search values

  • axis (str) – the axis along which sorted_sequence is sorted

  • side (str) – “left” or “right”. When one of the values exactly matches an element of the sorted_sequence, whether to choose the lower or higher index.

layer_class: Optional[str] = 'search_sorted'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(sorted_sequence, values, axis, name, network, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.SubnetworkLayer(subnetwork, _subnet, _output, concat_sources=True, load_on_init=None, dropout=0, dropout_noise_shape=None, _parent_layer_cache=None, _from=None, **kwargs)[source]

You can define a whole subnetwork as a single layer by this class.

The subnetwork will be specified by a dict[str,dict[str]], just like a normal network is specified in the config.

The "output" layer of the subnetwork will be the output of this subnetwork-layer.

With concat_sources=True (default),

the input to this layer will be represented as the "data:data" or simply "data" in the subnetwork,

otherwise with concat_sources=False,

the input to this layer will be represented as "data:input_layer_name" and also "data:0" to "data:<n-1>" for n inputs, for each input, in the subnetwork. The first input will also be simply available as "data:data"/``”data”`.

Parameters:
  • subnetwork (dict[str,dict]) – subnetwork as dict (JSON content). must have an “output” layer-

  • concat_sources (bool) – if we concatenate all sources into one, like it is standard for most other layers

  • load_on_init (str|dict[str]|None) – if provided, for parameter initialization, we will load the given model file. see CustomCheckpointLoader.

  • dropout (float) – will be applied if train_flag is set

  • dropout_noise_shape (tuple|list|dict|None)

  • _parent_layer_cache (dict[str,LayerBase]|None)

  • _subnet (returnn.tf.network.Subnetwork)

  • _output (LayerBase)

layer_class: Optional[str] = 'subnetwork'[source]
recurrent = True[source]
update_params_from_subnet()[source]

Update self.params.

update_rec_vars_outputs()[source]

Update self.rec_vars_outputs.

update_load_on_init()[source]

Handle load_on_init.

classmethod get_out_data_from_opts(n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

classmethod cls_get_sub_network(name, network, layer_desc)[source]
Parameters:
Return type:

returnn.tf.network.Subnetwork|None

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

Returns:

the sub_layer addressed in layer_name or None if no sub_layer exists

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_sub_networks()[source]
Return type:

list[returnn.tf.network.TFNetwork]

get_sub_layers()[source]
Return type:

list[LayerBase]

get_dep_layers()[source]
Returns:

list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.

Return type:

list[LayerBase]

get_last_hidden_state(key)[source]
Parameters:

key (int|str|None) – also the special key “*”

Return type:

tf.Tensor|None

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, encapsulate=False, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, encapsulate=False, **kwargs)[source]
Parameters:
Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

class returnn.tf.layers.basic.TrainFlagLayer(**kwargs)[source]

Returns the train flag (bool scalar) of the current network.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'train_flag'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, **kwargs)[source]
Parameters:

name (str)

Return type:

Data

class returnn.tf.layers.basic.GlobalTrainStepLayer(**kwargs)[source]

Returns the global train step (int64 scalar).

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'global_train_step'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, **kwargs)[source]
Parameters:

name (str)

Return type:

Data

class returnn.tf.layers.basic.AccumulateMeanLayer(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]

Accumulates the mean of the input (in training) (over batch-dim and time-dim by default). It’s similar to ReduceLayer

Parameters:
  • exp_average (float) – momentum in exponential average calculation

  • axes (int|list[str]|str) – the axes to reduce. must contain batch and time.

  • initial_value (float) – how to initialize the variable which accumulates the mean

  • is_prob_distribution (bool) – if provided, better default for initial_value

layer_class: Optional[str] = 'accumulate_mean'[source]
classmethod get_out_data_from_opts(axes='bt', **kwargs)[source]
Parameters:

axes (str)

Return type:

Data

class returnn.tf.layers.basic.LossLayer(loss_, target_=None, use_error=False, **kwargs)[source]

This layers wraps a Loss calculation as a layer. I.e. the loss will be calculated and returned by the layer. But this loss will not be used as a loss by the updater. If you want to use it as a loss, you can use the AsIsLoss, i.e. write "loss": "as_is".

Note that the loss options for the wrapped loss need to be provided via loss_opts_, and it does not apply any reduce function.

Note

The LossLayer might be deprecated in the future in favor of implementing the losses as actual layers.

If you want to define a loss inside the network, it is recommended to define it explicitly. An example could be:

"se_loss": {"class": "eval", "eval": "(source(0) - source(1)) ** 2", "from": ["output", "data:classes"]}

Followed by an e.g. mean reduce if needed:

"mse_loss": {"class": "reduce", "mode": "mean", "axis": "F", "from": "se_loss"}

loss_ and related params have the postfix _ to distinguish them from the loss options, which are used by the network and updater for training. Some of these (e.g. loss_opts_) are handled in transform_config_dict().

Parameters:
  • loss (Loss)

  • target (LayerBase|None)

  • use_error (bool) – whether to output the loss error instead of the loss value

layer_class: Optional[str] = 'loss'[source]
recurrent = True[source]
get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – sub layer name

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – sub layer name

  • parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, target_=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.ForcedAlignmentLayer(align_target, topology, input_type, blank_idx=-1, blank_included=False, **kwargs)[source]

Calculates a forced alignment, via Viterbi algorithm.

Parameters:
  • align_target (LayerBase)

  • topology (str) – e.g. “ctc” or “rna” (RNA is CTC without label loop)

  • input_type (str) – “log_prob” or “prob”

  • blank_idx (int) – vocab index of the blank symbol

  • blank_included (bool) – whether blank token of the align target is included in the vocabulary

layer_class: Optional[str] = 'forced_align'[source]
classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – sub layer name

  • parent_layer_kwargs (dict[str])

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str)

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.SparseSoftmaxCrossEntropyWithLogitsLayer(logits, targets, axis=None, **kwargs)[source]

This is a simple wrapper for tf.nn.sparse_softmax_cross_entropy_with_logits.

Parameters:
layer_class: Optional[str] = 'sparse_softmax_cross_entropy_with_logits'[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, logits, axis=None, **kwargs)[source]
Parameters:
  • name (str)

  • logits (LayerBase)

  • axis (Dim|str|None) – feature dim by default

class returnn.tf.layers.basic.CtcLossLayer(logits, targets, blank_index=-1, max_approx=False, **kwargs)[source]

Calculates the CTC loss.

Internally, this uses returnn.tf.native_op.ctc_loss() which is equivalent to tf.nn.ctc_loss but more efficient.

Output is of shape [B].

Parameters:
  • logits (LayerBase) – (before softmax). shape [B,T,D]

  • targets (LayerBase) – sparse. shape [B,T]

  • blank_index (int) – vocab index of the blank symbol

  • max_approx (bool) – if True, use max instead of sum over alignments (max approx, Viterbi)

layer_class: Optional[str] = 'ctc_loss'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, **kwargs)[source]
Parameters:

name (str)

class returnn.tf.layers.basic.FastBaumWelchLayer(align_target, align_target_key=None, ctc_opts=None, sprint_opts=None, input_type='log_prob', tdp_scale=1.0, am_scale=1.0, min_prob=0.0, staircase_seq_len_source=None, **kwargs)[source]

Calls fast_baum_welch() or fast_baum_welch_by_sprint_automata(). We expect that our input are +log scores, e.g. use log-softmax.

Parameters:
  • align_target (str) – e.g. “sprint”, “ctc” or “staircase”

  • align_target_key (str|None) – e.g. “classes”, used for e.g. align_target “ctc”

  • ctc_opts (dict[str]) – used for align_target “ctc”

  • sprint_opts (dict[str]) – used for Sprint (RASR) for align_target “sprint”

  • input_type (str) – “log_prob” or “prob”

  • tdp_scale (float)

  • am_scale (float)

  • min_prob (float) – clips the minimum prob (value in [0,1])

  • staircase_seq_len_source (LayerBase|None)

layer_class: Optional[str] = 'fast_bw'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.GradientLayer(y: LayerBase, x: LayerBase, **kwargs)[source]

Calculates the gradient of y w.r.t. x.

Parameters:
  • y

  • x

layer_class: Optional[str] = 'gradient'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(y: LayerBase, x: LayerBase, name: str, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.SyntheticGradientLayer(gradient, meta_loss_scale=1.0, **kwargs)[source]

This is a generalized way to be able to replace the true gradient with any kind of predicted gradient. This enabled to implement the idea from here:

Decoupled Neural Interfaces using Synthetic Gradients, https://arxiv.org/abs/1608.05343

Parameters:
  • gradient (LayerBase)

  • meta_loss_scale (float)

layer_class: Optional[str] = 'synthetic_gradient'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.TikhonovRegularizationLayer(meta_loss_scale=1.0, **kwargs)[source]

Adds the Tikhonov regularization as a meta-loss (see returnn.tf.util.basic.MetaLosses).

Parameters:

meta_loss_scale (float)

layer_class: Optional[str] = 'tikhonov_regularization'[source]
class returnn.tf.layers.basic.FramewiseStatisticsLayer(sil_label_idx, histogram_num_bins=20, **kwargs)[source]

Collects various statistics (such as FER, etc) on the sources. The tensors will get stored in self.stats which will be collected by TFEngine.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'framewise_statistics'[source]
classmethod get_out_data_from_opts(**kwargs)[source]
Return type:

Data

class returnn.tf.layers.basic.PrintLayer(summarize=99, extra_print_args=(), **kwargs)[source]

Prints the sources to console/log, via returnn.tf.util.basic.py_print().

Parameters:
  • summarize (int|None) – passed to py_print()

  • extra_print_args (list|tuple)

layer_class: Optional[str] = 'print'[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.basic.HDFDumpLayer(filename, extra=None, dump_whole_batches=False, labels=None, extend_existing_file=False, dump_per_run=False, **kwargs)[source]

Dumps into HDF file, compatible to HDFDataset.

The HDF will be written to disk under the specified filename, if there was no error, by default at graph reset, via TFNetwork.register_graph_reset_callback(). Or after the dataset iteration run loop, with dump_per_run, via TFNetwork.register_run_finished_callback().

Common usage would be to add this to your network with “is_output_layer”: True, such that you don’t need to make other layers depend on it.

It currently uses SimpleHDFWriter internally.

Parameters:
  • filename (str|(()->str))

  • extra (None|dict[str,LayerBase])

  • dump_whole_batches (bool) – dumps the whole batch as a single sequence into the HDF

  • labels (list[str]|None)

  • extend_existing_file (bool) – True also means we expect that it exists

  • dump_per_run (bool) – write via TFNetwork.register_run_finished_callback()

layer_class: Optional[str] = 'hdf_dump'[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.basic.ImageSummaryLayer(max_outputs=3, **kwargs)[source]

Creates image summaries which can be viewed in TensorBoard. This layer expects the source to be in (T-decoder, T-encoder, B, 1).

Parameters:

max_outputs – number of images to generate per step

layer_class: Optional[str] = 'image_summary'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace, the loss_opts

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(**kwargs)[source]
Return type:

Data

class returnn.tf.layers.basic.CrossEntropyLoss(input_type='prob', focal_loss_factor=0.0, label_smoothing=0.0, label_smoothing_gaussian=False, debug_dump=False, safe_log_opts=None, use_fused=True, fake_upper_bound=None, **kwargs)[source]

Cross-Entropy loss. Basically sum(target * log(output)).

Parameters:
class_name: str = 'ce'[source]
need_target = True[source]
get_output_target_scores()[source]
Returns:

shape (time_flat,), type float32, std-prob space

Return type:

tf.Tensor

get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.BinaryCrossEntropyLoss(pos_weight=None, **kwargs)[source]

Binary cross entropy. We expect the output as logits, not in probability space! Per frame: mean(target * log(sigmoid(output)) + (1 - target) * log(1 - sigmoid(output)))

Parameters:

pos_weight (float|None) – weight of positive labels, see tf.nn.weighted_cross_entropy_with_logits.

class_name: str = 'bin_ce'[source]
get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Returns:

frame error rate as a scalar value with the default self.reduce_func (see also self.get_value)

Return type:

tf.Tensor

class returnn.tf.layers.basic.GenericCELoss(**kwargs)[source]

Some generalization of cross entropy.

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'generic_ce'[source]
get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.CtcLoss(target_collapse_repeated=False, auto_clip_target_len=False, output_in_log_space=False, beam_width=100, ctc_opts=None, use_native=False, use_viterbi=False, **kwargs)[source]

Connectionist Temporal Classification (CTC) loss. Basically a wrapper around tf.nn.ctc_loss.

Parameters:
  • target_collapse_repeated (bool) – like preprocess_collapse_repeated option for CTC. used for sparse_labels().

  • auto_clip_target_len (bool) – see self._get_target_sparse_labels().

  • output_in_log_space (bool) – False -> output expected in prob space. see self.get_output_logits

  • beam_width (int) – used in eval

  • ctc_opts (dict[str]|None) – other kwargs used for tf.nn.ctc_loss

  • use_native (bool) – use our native implementation (TFNativeOp.ctc_loss())

  • use_viterbi (bool) – instead of full-sum, use only best path (via ctc_loss_viterbi())

class_name: str = 'ctc'[source]
recurrent = True[source]
init(**kwargs)[source]

See super.

get_output_logits()[source]
Returns:

outputs in log-space / logits

Return type:

tf.Tensor

get_soft_alignment()[source]

Also called the Baum-Welch-alignment. This is basically p_t(s|x_1^T,w_1^N), where s are the output labels (including blank), and w are the real target labels.

Returns:

shape (time, batch, dim)

Return type:

tf.Tensor

get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

tf.Tensor

classmethod get_auto_output_layer_dim(target_dim)[source]
Parameters:

target_dim (returnn.tensor.Dim)

Return type:

returnn.tensor.Dim

class returnn.tf.layers.basic.EditDistanceLoss(debug_print=False, label_map=None, ctc_decode=False, output_in_log_space=False, **kwargs)[source]

Note that this loss is not differentiable, thus it’s only for keeping statistics.

Parameters:
  • debug_print (bool) – will tf.Print the sequence

  • label_map (dict[int,int]|None) – before calculating the edit-distance, will apply this map

  • ctc_decode (bool) – True -> expects dense output and does CTC decode, False -> expects sparse labels in output

  • output_in_log_space (bool) – False -> dense output expected in prob space. see self.get_output_logits

class_name: str = 'edit_distance'[source]
recurrent = True[source]
init(output, output_with_activation=None, target=None, **kwargs)[source]
Parameters:
  • output (Data) – generated output

  • output_with_activation (OutputWithActivation|None)

  • target (Data) – reference target from dataset

get_output_logits()[source]
Returns:

outputs in log-space / logits

Return type:

tf.Tensor

get_error()[source]
Return type:

tf.Tensor

get_value()[source]
Return type:

None

class returnn.tf.layers.basic.BleuLoss(**kwargs)[source]

Note that this loss is not differentiable, thus it’s only for keeping statistics. Also, BLEU is a score, i.e. the higher, the better. Thus, to interpret it as a loss or error, we take the negative value.

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'bleu'[source]
recurrent = True[source]
init(output, output_with_activation=None, target=None, **kwargs)[source]
Parameters:
  • output (Data) – generated output

  • output_with_activation (OutputWithActivation|None)

  • target (Data) – reference target from dataset

get_error()[source]
Return type:

tf.Tensor

get_value()[source]
Return type:

None

class returnn.tf.layers.basic.ExpectedLoss(loss, loss_kind, norm_scores=True, norm_scores_stop_gradient=True, divide_beam_size=True, subtract_average_loss=True, loss_correction_grad_only=False, **kwargs)[source]

This loss uses another loss error or value and given the search beam scores, calculates the expected loss. Sometimes also called minimum Bayes risk.

Parameters:
  • loss (Loss)

  • loss_kind (str) – “error” or “value”. whether to use loss.get_error() or loss.get_value()

  • norm_scores (bool)

  • norm_scores_stop_gradient (bool)

  • divide_beam_size (bool)

  • subtract_average_loss (bool)

  • loss_correction_grad_only (bool)

class_name: str = 'expected_loss'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
init(**kwargs)[source]

Overwrites super. Get search choices.

get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

None

class returnn.tf.layers.basic.DeepClusteringLoss(embedding_dimension, nr_of_sources, **kwargs)[source]

Cost function used for deep clustering as described in [Hershey & Chen+, 2016]: “Deep clustering discriminative embeddings for segmentation and separation”

Parameters:
  • embedding_dimension (int)

  • nr_of_sources (int)

class_name: str = 'deep_clustering'[source]
get_error()[source]
Returns:

frame error rate as a scalar value

Return type:

tf.Tensor | None

get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.L1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]

L1-distance loss. sum(target - output).

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'l1'[source]
get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.MeanSquaredError(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]

The generic mean squared error loss function

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'mse'[source]
get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.MeanL1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]

Like MSE loss, but with absolute difference

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'mean_l1'[source]
get_value()[source]
Return type:

tf.Tensor

class returnn.tf.layers.basic.ExternSprintLoss(sprint_opts, **kwargs)[source]

The loss is calculated by an extern Sprint instance.

Parameters:

sprint_opts (dict[str])

class_name: str = 'sprint'[source]
recurrent = True[source]
need_target = False[source]
get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

tf.Tensor|None

class returnn.tf.layers.basic.FastBaumWelchLoss(sprint_opts, tdp_scale=1.0, **kwargs)[source]

The loss is calculated via fast_baum_welch(). The automata are created by an extern Sprint instance.

Parameters:

sprint_opts (dict[str])

class_name: str = 'fast_bw'[source]
recurrent = True[source]
need_target = False[source]
get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

tf.Tensor|None

class returnn.tf.layers.basic.ViaLayerLoss(error_signal_layer=None, align_layer=None, loss_wrt_to_act_in=False, **kwargs)[source]

The loss error signal and loss value is defined as the output of another layer. That way, you can define any custom loss. This could e.g. be used together with the fast_bw layer.

This is a more custom variant of AsIsLoss, which simply takes the output of a layer as loss without redefining the error signal (gradient).

Parameters:
  • error_signal_layer (LayerBase)

  • align_layer (LayerBase)

  • loss_wrt_to_act_in (bool|str) – if True, we expect that the given output_with_activation is set, and the given error signal is w.r.t. the input of the specific activation function. A common example is the input to the softmax function, where the gradient is much more stable to define, e.g. y - z instead of y/z for cross entropy. If you specify a str, e.g. “softmax” or “log_softmax”, there is an additional check that the used activation function is really that one.

class_name: str = 'via_layer'[source]
recurrent = True[source]
need_target = False[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace, the loss_opts

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

tf.Tensor|None

class returnn.tf.layers.basic.AsIsLoss(as_error=False, **kwargs)[source]

Use the output as-is as the loss.

Also see ViaLayerLoss which also allows to define a custom error signal (gradient).

Parameters:

as_error (bool) – if True, use the output as error, otherwise (default) use the output as loss value. Error is purely for reporting, loss value is used for the optimizer as well (when scale != 0).

class_name: str = 'as_is'[source]
need_target = False[source]
get_value()[source]
Return type:

tf.Tensor|None

get_error()[source]
Return type:

tf.Tensor|None

class returnn.tf.layers.basic.SearchScoreLoss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]

Use the scores from SearchChoices.

Parameters:
  • base_network (returnn.tf.network.TFNetwork)

  • use_flatten_frames (bool) – will use returnn.tf.util.basic.flatten_with_seq_len_mask()

  • use_normalized_loss (bool) – the loss used in optimization will be normalized

  • custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See Loss.init() for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.

  • custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).

  • scale (float) – additional scale factor for the loss

  • _check_output_before_softmax (bool|None)

class_name: str = 'search_score'[source]
need_target = False[source]
reduce_to_batch(loss, normalize)[source]
Parameters:
  • loss (tf.Tensor) – (batch,)

  • normalize (bool) – reduce mean instead of reduce sum

Returns:

(batch,)

Return type:

tf.Tensor

get_value()[source]
Return type:

tf.Tensor

get_error()[source]
Return type:

None

class returnn.tf.layers.basic.SamplingBasedLoss(num_sampled=128, num_splits=1, sampler='log_uniform', nce_loss=False, use_full_softmax=False, remove_accidental_hits=None, sampler_args=None, nce_log_norm_term=0.0, **kwargs)[source]

Implement two sampling based losses, sampled softmax (default) and noise contrastive esti