Basic Layers¶

Accumulate Mean Layer¶

class returnn.tf.layers.basic.AccumulateMeanLayer(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]¶

Accumulates the mean of the input (in training) (over batch-dim and time-dim by default). It’s similar to ReduceLayer

Parameters:

exp_average (float) – momentum in exponential average calculation
axes (int|list[str]|str) – the axes to reduce. must contain batch and time.
initial_value (float) – how to initialize the variable which accumulates the mean
is_prob_distribution (bool) – if provided, better default for initial_value

layer_class: Optional[str] = 'accumulate_mean'[source]¶

classmethod get_out_data_from_opts(axes='bt', **kwargs)[source]¶

Parameters:: axes (str)
Return type:: Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Activation Layer¶

class returnn.tf.layers.basic.ActivationLayer(activation, opts=None, **kwargs)[source]¶

This layer just applies an activation function. See returnn.tf.util.basic.get_activation_function() about supported functions. Also see EvalLayer and CombineLayer for similar layers.

Parameters:

activation (str) – e.g. “relu”, “tanh”, etc
opts (dict[str]|None) – for activation function, e.g. eps for safe_log

layer_class: Optional[str] = 'activation'[source]¶

output_before_activation: OutputWithActivation | None[source]¶

classmethod get_out_data_from_opts(activation, **kwargs)[source]¶

Parameters:: activation (str)
Return type:: Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Combine Layer¶

class returnn.tf.layers.basic.CombineLayer(kind, sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]¶

Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the left-hand operator.

Its basic working is similar to the reduce function used in functional programming. Also see ActivationLayer, or CompareLayer.

Parameters:

kind (str) – currently accepted values are average, add, sub, mul, truediv, floordiv, mod, pow, maximum, minimum, logical_and, logical_or, squared_difference, logaddexp, or eval, or any function in the tf.math or tf namespace.
sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.
activation (str|None) – if provided, activation function to apply, e.g. “tanh” or “relu”
with_bias (bool) – if given, will add a trainable bias tensor
eval (str|callable) – for kind=”eval”, will eval this string. or function. see _op_kind_eval()
eval_locals (dict[str]|None) – locals for eval
eval_for_output_loss (bool) – will do the same eval on layer.output_loss

layer_class: Optional[str] = 'combine'[source]¶

recurrent = True[source]¶

output_loss: tf.Tensor | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

classmethod get_out_data_from_opts(network, sources, eval_locals=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
eval_locals (dict[str]|None) – locals for eval, will also pass to out_type is out_type is a function
n_out (int|None|NotSpecified)
allow_broadcast_all_sources (bool|NotSpecified)
out_type (dict[str]|None|(()->Data))
out_shape (set[Dim|_MarkedDim]|tuple|list|None) – verifies the output shape (dim tags)

Return type:

Data

kwargs: Dict[str] | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Compare Layer¶

class returnn.tf.layers.basic.CompareLayer(kind='equal', value=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Compares element-wise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.

Example:

{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}

computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical “and” of all comparisons. Note that value is the last element to be compared to.

A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the end-of-sentence token:

"output": {"class": "rec", "from": [], "unit": {
    .
    .
    .
    "end": {"class": "compare", "from": "output", "value": end_of_sentence_id}
}, "target": "classes0"}

Parameters:

kind (str) – which comparison operation to use, e.g. “equal”, “greater”, “less” or other supported TF comparison ops
value (float|int|None) – if specified, will also compare to this
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.

layer_class: Optional[str] = 'compare'[source]¶

classmethod get_out_data_from_opts(sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_shape=None, **kwargs)[source]¶

Parameters:

sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified)
n_out (int|None|NotSpecified)
out_type (dict[str]|None)
out_shape (dict[str]|None)

Return type:

Data

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Constant Layer¶

class returnn.tf.layers.basic.ConstantLayer(sources, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]¶

Output is a constant value.

Parameters:

sources (list[LayerBase])
value (int|float|bool|numpy.ndarray)
shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags
dtype (str|None)
with_batch_dim (bool)
sparse_dim (Dim|None)
feature_dim (Dim|None)
shape_deps (list[LayerBase]) – for dyn dim tags in shape

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=<class 'returnn.util.basic.NotSpecified'>, shape_deps=(), **kwargs)[source]¶

Parameters:

name (str)
value (int|float|bool)
shape (tuple[Dim|int]|list[Dim|int]) – for verification, and defining dim tags
dtype (str|None)
with_batch_dim (bool)
sparse_dim (Dim|None)
feature_dim (Dim|None|NotSpecified)
shape_deps (list[LayerBase]) – for dyn dim tags in shape

Return type:

Data

Convolution Layer¶

class returnn.tf.layers.basic.ConvLayer(filter_size, padding, strides=1, dilation_rate=1, groups=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, with_bias=<class 'returnn.util.basic.NotSpecified'>, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, pad_seq_len_to_power=None, **kwargs)[source]¶

A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate “pool” layer.

Parameters:

filter_size (Sequence[Dim]|Sequence[int]) – (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. The input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. It will automatically swap the batch-dim to the first axis of the input data.
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
strides (int|Sequence[int]) – strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.
dilation_rate (int|Sequence[int]) – dilation for the spatial dims
groups (int) – grouped convolution
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
auto_use_channel_first (bool|NotSpecified) – convert the input to NCHW or not
with_bias (bool|NotSpecified) – if True, will add a bias to the output features. True by default since behavior version 10.
activation (None|str) – if set, will apply this function at the end
filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter
filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)
bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias
use_time_mask (bool)
pad_seq_len_to_power (Optional[float]) – pad sequence length to power of given number to reduce number of different sequence lengths. See https://github.com/rwth-i6/returnn/issues/1450 and https://github.com/tensorflow/tensorflow/issues/62441.

layer_class: Optional[str] = 'conv'[source]¶

recurrent = True[source]¶

output_before_activation: OutputWithActivation | None[source]¶

classmethod set_output_dim_tags(output, num_batch_dims, in_spatial_dims, out_spatial_dims, filter_size, strides, dilation_rate, padding)[source]¶

Parameters:

output (Data)
num_batch_dims (int)
in_spatial_dims (Sequence[Dim])
out_spatial_dims (Sequence[Dim]|None)
filter_size (Sequence[int|Dim])
strides (Sequence[int])
dilation_rate (Sequence[int])
padding (str|int|Sequence[int])

classmethod transform_input(input_data, network, in_dim=None, in_spatial_dims=None, input_expand_dims=0, input_split_feature_dim=None, input_add_feature_dim=False, use_time_mask=False, mask_value: float = 0.0)[source]¶

Parameters:

input_data (Data)
network (returnn.tf.network.TFNetwork)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
use_time_mask (bool)
mask_value – when use_time_mask is used, what value to use for the mask

Returns:

(transformed input, num batch dims). all batch dims are at the front

Return type:

(Data, int)

classmethod get_input_placeholder_with_same_static_padding(input_data: Tensor, num_batch_dims: int, filter_size: Sequence[int], strides: Sequence[int], out_batch_feature_major: bool) → Tensor[source]¶

Returns the placeholder of input_data with same_static padding applied to it.

Parameters:

input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
filter_size
strides
out_batch_feature_major

classmethod get_input_placeholder_with_int_padding(input_data: Tensor, *, num_batch_dims: int, out_batch_feature_major: bool, padding: int | Sequence[int], pad_value: float = 0.0) → Tensor[source]¶

Returns the placeholder of input_data with same_static padding applied to it.

Parameters:

input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
out_batch_feature_major
padding
pad_value

classmethod calc_out_dim(in_dim, filter_size, stride, padding, dilation_rate=1)[source]¶

Parameters:

in_dim (T|int|tf.Tensor|Dim) – dimension in some axis
filter_size (int|Dim) – e.g. 2, for the corresponding axis
stride (int) – e.g. 1, for the corresponding axis
dilation_rate (int) – e.g. 1
padding (str|int) – “valid” or “same”

Returns:

the output dimension

Return type:

classmethod get_out_data_from_opts(name, sources, network, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (Sequence[LayerBase])
network (returnn.tf.network.TFNetwork)
filter_size (Sequence[int|Dim])
padding (str|int|Sequence[int])
strides (int|Sequence[int])
dilation_rate (int|Sequence[int])
input_expand_dims (int) – number of dynamic dims to add to the input
input_add_feature_dim (bool)
input_split_feature_dim (None|int)
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims – number of spatial dims to add to the input
auto_use_channel_first (bool|NotSpecified)

Return type:

Data

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Copy Layer¶

class returnn.tf.layers.basic.CopyLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]¶

This layer does nothing, it copies its input. This is not even a tf.identity. It refers to the same TF tensor. If multiple sources are provided, they are concatenated in the feature-dim.

Parameters:

in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.
out_dim (Dim|None) – alternative to in_dim. see in_dim doc.
extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, use IdentityLayer.

layer_class: Optional[str] = 'copy'[source]¶

output_loss: tf.Tensor | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

get_dep_layers()[source]¶

Return type:: list[LayerBase]

classmethod get_out_data_from_opts(name, sources=(), extra_deps=(), out_type=None, in_dim=None, out_dim=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
extra_deps (list[LayerBase])
out_type (dict[str]|None)
in_dim (Dim|None)
out_dim (Dim|None)
n_out (int|None|NotSpecified)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)

Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Cumulative Sum Layer¶

class returnn.tf.layers.basic.CumsumLayer(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]¶

Basically wraps tf.cumsum. Also supports that in the RecLayer.

Parameters:

axis (str) – see Data.get_axis_from_description()
additional_left_summand_per_element (str|int|float|None) – the order matters for tf.string
reverse (bool)

layer_class: Optional[str] = 'cumsum'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, axis='T', **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axis (str)

Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, axis='T', sources=(), **kwargs)[source]¶

Parameters:

network (returnn.tf.network.TFNetwork)
batch_dim (tf.Tensor)
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase)
axis (str)
sources (list[LayerBase])

Return type:

dict[str,tf.Tensor]

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Dot Layer¶

class returnn.tf.layers.basic.DotLayer(reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, use_mask: bool = True, debug=False, **kwargs)[source]¶

This performs a dot-product of two sources. The underlying matmul expects shapes (shared…, I, J) * (shared…, J, K) -> (shared…, I, K). We say that J is the axis to be reduced, I is the var-dim of source 1, and K is the var-dim of source 2. I, J, K can also be multiple axes from the sources. The var-dims don’t need to exist. All other axes (shared…) are expected to match.

You should try to avoid having the same dims in both sources when they are not reduced such that you would end up having some dim twice in the output, e.g. (shared…, I, I). You should avoid this because the dim order should never matter (https://github.com/rwth-i6/returnn/wiki/RETURNN-principles). If you need to perform such an operation, you can use ReinterpretDataLayer to introduce a new dim tag.

The reduce dim can also be the sparse dim of one of the sources. In this case, it behaves like GatherLayer.

Parameters:

reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool) – if var2=None, add dim=1 at the end
use_mask – If the reduction is over dynamic axes, to get the correct sum reduction, we need to apply masking to one of the inputs. This is done automatically. By disabling this flag, this would be disabled.
debug (bool) – will print debug shapes, etc.

Earlier defaults:: red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True.
However, these are bad, for multiple reasons, like using integers, but also in general.: See https://github.com/rwth-i6/returnn/issues/627 for details.

layer_class: Optional[str] = 'dot'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool)

Return type:

Data

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Elementwise Product Layer¶

class returnn.tf.layers.basic.ElemwiseProdLayer(axes, size=None, **kwargs)[source]¶

Element-wise product in some axes. Microsoft calls this “static attention”, in Deep Conv. NN with Layer-wise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also LinearLayer.

Parameters:

axes (str|list[str]) – e.g. “spatial”, but all those axes must be of fixed dimension
size (tuple[int]) – for double-checking, you can explicitly provide the size

layer_class: Optional[str] = 'elemwise_prod'[source]¶

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Gating Layer¶

class returnn.tf.layers.basic.GatingLayer(activation, gate_activation='sigmoid', out_dim=None, **kwargs)[source]¶

Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then element-wise multiplies them. Thus, the output dimension is input-dimension / 2.

Parameters:

activation (str)
gate_activation (str)
out_dim (Dim|None)

layer_class: Optional[str] = 'gating'[source]¶

classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
n_out (int|None|NotSpecified)
out_dim (Dim|None)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Linear Layer¶

class returnn.tf.layers.basic.LinearLayer(activation=None, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]¶

Linear/forward/fully-connected/1x1-conv layer. Does a linear transformation on the feature-dimension of the input with an optional bias term and an optional activation function. See also DotLayer, ElemwiseProdLayer, WeightedSumLayer.

Parameters:

activation (str|None) – e.g. “relu”, or None
with_bias (bool)
grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()
recurrent_weights_init (str) – see returnn.tf.util.basic.get_initializer()
bias_init (str|float) – see returnn.tf.util.basic.get_initializer()
use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).

layer_class: Optional[str] = 'linear'[source]¶

output_before_activation: OutputWithActivation | None[source]¶

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Pooling Layer¶

class returnn.tf.layers.basic.PoolLayer(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, use_time_mask=False, **kwargs)[source]¶

A generic N-D pooling layer. This would usually be done after a convolution for down-sampling.

Parameters:

mode (str) – “max” or “avg”
pool_size (Sequence[int]) – shape of the window of each reduce
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
dilation_rate (Sequence[int]|int)
strides (Sequence[int]|int|None) – in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified) – if set, will transform input to NCHW format
use_time_mask (bool)

layer_class: Optional[str] = 'pool'[source]¶

recurrent = True[source]¶

classmethod get_out_data_from_opts(name, sources, network, pool_size, strides=None, dilation_rate=1, padding='VALID', in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
pool_size (Sequence[int])
strides (Sequence[int]|int)
dilation_rate (int|Sequence[int])
padding (str|int|Sequence[int])
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Reduce Layer¶

class returnn.tf.layers.basic.ReduceLayer(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]¶

This reduces some axis by using e.g. “sum” or “max”. It’s basically a wrapper around tf.reduce_sum or tf.reduce_max.

Parameters:

mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (Sequence[Dim|str]) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().
axis (Dim|str) – for compatibility, can be used instead of axes
keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int|None) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

layer_class: Optional[str] = 'reduce'[source]¶

classmethod reduce(input_data, mode, axes=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None)[source]¶

Parameters:

input_data (Data)
mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (int|list[int]|str) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().
keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.

Return type:

tf.Tensor

classmethod need_enforce_batch_dim_axis(axes)[source]¶

Parameters:: axes (int|list[int]|str|Dim)
Returns:: if any integer is in axes, thus we should have a fixed dimension layout
Return type:: bool

classmethod get_axes(axis, input_data)[source]¶

Parameters:

axis – see self.__init__()
input_data (Data)

Returns:

list of axes

Return type:

list[int]

classmethod get_out_data_from_opts(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
mode (str) – (default here “” because other code uses this function)
axes (str|list[str]|None)
axis (str|None)
keep_dims (bool)
enforce_batch_dim_axis (int|None)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Reduce-Out Layer¶

class returnn.tf.layers.basic.ReduceOutLayer(mode, num_pieces, out_dim=None, **kwargs)[source]¶

Combination of SplitDimsLayer applied to the feature dim and ReduceLayer applied to the resulting feature dim. This can e.g. be used to do maxout.

Parameters:

mode (str) – “sum” or “max” or “mean”
num_pieces (int) – how many elements to reduce. The output dimension will be input.dim // num_pieces.
out_dim (Dim|None)

layer_class: Optional[str] = 'reduce_out'[source]¶

classmethod get_out_data_from_opts(num_pieces, sources, name, out_dim=None, **kwargs)[source]¶

Parameters:

num_pieces (int)
sources (list[LayerBase])
name (str)
out_dim (Dim|None)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Switch Layer¶

class returnn.tf.layers.basic.SwitchLayer(condition, true_from, false_from, **kwargs)[source]¶

Wrapper around tf.where() (or more generically returnn.tf.util.basic.where_bc()), or statically choose a single source if the condition is a callable (…)->bool. (tf.cond is not useful here, as the sources would have been already constructed and computed.)

This layer is also useful for applying any kind of generic masking to the frames. E.g. one could have a layer called “mask” computing a boolean mask for the values stored in another layer “input”. Then use this layer with condition=”mask”, true_from=”input”, false_from=mask_value, to mask out all frames where the mask is false with the mask_value.

See also CondLayer. See also SeqLenMaskLayer if you just want to mask using the sequence lengths.

Parameters:

condition (LayerBase|bool) – if callable, expected to be (…)->bool, and called in transform_config_dict
true_from (LayerBase|float|int|None)
false_from (LayerBase|float|int|None)

layer_class: Optional[str] = 'switch'[source]¶

classmethod transform_config_dict(d, network, get_layer)[source]¶

Parameters:

d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, condition, true_from, false_from, **kwargs)[source]¶

Parameters:

name (str)
condition (LayerBase|bool)
true_from (LayerBase|float|int|None)
false_from (LayerBase|float|int|None)

Return type:

Data

get_dep_layers()[source]¶

Return type:: list[LayerBase]

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶

Variable Layer¶

Weighted Sum Layer¶

class returnn.tf.layers.basic.WeightedSumLayer(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]¶

Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use ElemwiseProdLayer and ReduceLayer, or just a DotLayer with a VariableLayer. See also LinearLayer.

Parameters:

axes (str|list[str]) – the axes to do the weighted-sum over
padding (str) – “valid” or “same”, in case of keep_dims=True
size (None|tuple[int]) – the kernel-size. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=”valid” by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.
keep_dims (bool) – if False, the axes will be squeezed away. see also size.

layer_class: Optional[str] = 'weighted_sum'[source]¶

classmethod get_out_data_from_opts(name, sources, axes, padding=None, size=None, keep_dims=None, **kwargs)[source]¶

Parameters:

name (str)
sources (list[LayerBase])
axes (str|list[str])
padding (str|None)
size (None|tuple[int])
keep_dims (bool|None)

Return type:

Data

input_data: Data | None[source]¶

kwargs: Dict[str] | None[source]¶

output_before_activation: OutputWithActivation | None[source]¶

output_loss: tf.Tensor | None[source]¶

rec_vars_outputs: Dict[str, tf.Tensor][source]¶

search_choices: SearchChoices | None[source]¶

params: Dict[str, tf.Variable][source]¶

saveable_param_replace: Dict[tf.Variable, 'BaseSaverBuilder.SaveableObject' | None][source]¶

stats: Dict[str, tf.Tensor][source]¶