Basic Layers

Accumulate Mean Layer

class returnn.tf.layers.basic.AccumulateMeanLayer(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]

Accumulates the mean of the input (in training) (over batch-dim and time-dim by default). It’s similar to ReduceLayer

Parameters:
  • exp_average (float) – momentum in exponential average calculation
  • axes (int|list[str]|str) – the axes to reduce. must contain batch and time.
  • initial_value (float) – how to initialize the variable which accumulates the mean
  • is_prob_distribution (bool) – if provided, better default for initial_value
layer_class = 'accumulate_mean'[source]
classmethod get_out_data_from_opts(axes='bt', **kwargs)[source]
Parameters:axes (str) –
Return type:Data

Activation Layer

class returnn.tf.layers.basic.ActivationLayer(activation, **kwargs)[source]

This layer just applies an activation function. See TFUtil.get_activation_function() about supported functions. Also see EvalLayer and CombineLayer for similar layers.

Parameters:activation (str) – e.g. “relu”, “tanh”, etc
layer_class = 'activation'[source]
classmethod get_out_data_from_opts(activation, **kwargs)[source]
Parameters:activation (str) –
Return type:Data

Combine Layer

class returnn.tf.layers.basic.CombineLayer(kind, sources, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]

Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the left-hand operator.

Its basic working is similar to the reduce function used in functional programming. Also see ActivationLayer, or CompareLayer.

Parameters:
  • kind (str) – currently accepted values are average, add, sub, mul, or eval
  • sources (list[LayerBase]) –
  • activation (str|None) – if provided, activation function to apply, e.g. “tanh” or “relu”
  • with_bias (bool) – if given, will add a trainable bias tensor
  • eval (str|callable) – for kind=”eval”, will eval this string. or function. see _op_kind_eval()
  • eval_locals (dict[str]|None) – locals for eval
  • eval_for_output_loss (bool) – will do the same eval on layer.output_loss
layer_class = 'combine'[source]
classmethod get_out_data_from_opts(n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, sources=(), **kwargs)[source]
Parameters:
  • n_out (int|None|NotSpecified) –
  • out_type (dict[str]|None) –
  • sources (list[LayerBase]) –
Return type:

Data

Compare Layer

class returnn.tf.layers.basic.CompareLayer(kind='equal', value=None, **kwargs)[source]

Compares element-wise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.

Example:

{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}

computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical “and” of all comparisons. Note that value is the last element to be compared to.

A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the end-of-sentence token:

"output": {"class": "rec", "from": [], "unit": {
    .
    .
    .
    "end": {"class": "compare", "from": "output", "value": end_of_sentence_id}
}, "target": "classes0"}
Parameters:
  • kind (str) – which comparison operation to use, e.g. “equal”, “greater”, “less” or other supported TF comparison ops
  • value (float|int|None) – if specified, will also compare to this
layer_class = 'compare'[source]
classmethod get_out_data_from_opts(n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, sources=(), **kwargs)[source]
Parameters:
  • n_out (int|None|NotSpecified) –
  • out_type (dict[str]|None) –
  • sources (list[LayerBase]) –
Return type:

Data

Constant Layer

class returnn.tf.layers.basic.ConstantLayer(sources, value=0.0, dtype=None, with_batch_dim=False, **kwargs)[source]

Output is a constant value.

Parameters:
  • sources (list[LayerBase]) –
  • value (int|float|bool) –
  • dtype (str|None) –
  • with_batch_dim (bool) –
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (returnn.tf.network.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(name, value=0.0, dtype=None, with_batch_dim=False, **kwargs)[source]
Parameters:
  • name (str) –
  • value (int|float|bool) –
  • dtype (str|None) –
  • with_batch_dim (bool) –
Return type:

Data

Convolution Layer

class returnn.tf.layers.basic.ConvLayer(n_out, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, auto_use_channel_first=False, with_bias=False, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, **kwargs)[source]

A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate “pool” layer.

Parameters:
  • n_out (int) – number of outgoing features
  • filter_size (tuple[int]) – (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. the input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. it will automatically swap the batch-dim to the first axis of the input data.
  • padding (str) – “same” or “valid”
  • strides (int|tuple[int]) – strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.
  • dilation_rate (int|tuple[int]) – dilation for the spatial dims
  • input_expand_dims (int) – number of dynamic dims to add to the input
  • input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
  • auto_use_channel_first (bool) – convert the input to NCHW or not
  • input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
  • with_bias (bool) – if True, will add a bias to the output features
  • activation (None|str) – if set, will apply this function at the end
layer_class = 'conv'[source]
recurrent = True[source]
classmethod calc_out_dim(in_dim, filter_size, stride, padding, dilation_rate=1)[source]
Parameters:
  • in_dim (int|tf.Tensor|T) – dimension in some axis
  • filter_size (int) – e.g. 2, for the corresponding axis
  • stride (int) – e.g. 1, for the corresponding axis
  • dilation_rate (int) – e.g. 1
  • padding (str) – “valid” or “same”
Returns:

the output dimension

Return type:

T

classmethod get_out_data_from_opts(**kwargs)[source]

Via _get_out_type_from_opts().

Return type:Data

Copy Layer

class returnn.tf.layers.basic.CopyLayer(extra_deps=(), **kwargs)[source]

This layer does nothing, it copies its input. If multiple sources are provided, they are concatenated in the feature-dim.

Parameters:extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via SelectSearchSourcesLayer. We only have this here for the CopyLayer because the get_out_data_from_opts() must know about it and define the right beam. Also see the option collocate_with, which is different in that it does not add a dependency.
layer_class = 'copy'[source]
get_dep_layers()[source]
Return type:list[LayerBase]
classmethod get_out_data_from_opts(name, sources=(), extra_deps=(), out_type=None, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • extra_deps (list[LayerBase]) –
  • out_type (dict[str]|None) –
  • n_out (int|None|NotSpecified) –
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (returnn.tf.network.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer

Cumulative Sum Layer

class returnn.tf.layers.basic.CumsumLayer(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]

Basically wraps tf.cumsum. Also supports that in the RecLayer.

Parameters:
  • axis (str) – see Data.get_axis_from_description()
  • additional_left_summand_per_element (str|int|float|None) – the order matters for tf.string
  • reverse (bool) –
layer_class = 'cumsum'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, axis='T', **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • axis (str) –
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, axis='T', sources=(), **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • rec_layer (TFNetworkRecLayer.RecLayer|LayerBase) –
  • axis (str) –
  • sources (list[LayerBase]) –
Return type:

dict[str,tf.Tensor]

Dot Layer

class returnn.tf.layers.basic.DotLayer(red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True, debug=False, **kwargs)[source]

This performs a dot-product of two sources. The underlying matmul expects shapes (shared…, I, J) * (shared…, J, K) -> (shared…, I, K). We say that J is the axis to be reduced, I is the var-dim of source 1, and K is the var-dim of source 2. I, J, K can also be multiple axes from the sources. The var-dims don’t need to exist. All other axes (shared…) are expected to match.

Parameters:
  • red1 (str|int|tuple[str|int]|list[str|int]) – reduce axes of first source
  • red2 (str|int|tuple[str|int]|list[str|int]) – reduce axes of second source
  • var1 (str|int|tuple[str|int]|list[str|int]|None) – var axes of first source
  • var2 (str|int|tuple[str|int]|list[str|int]|None) – var axes of second source
  • add_var2_if_empty (bool) – if var2=None, add dim=1 at the end
  • debug (bool) – will print debug shapes, etc.
layer_class = 'dot'[source]
classmethod get_out_data_from_opts(name, sources, red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • red1 (str|int|tuple[str|int]|list[str|int]) – reduce axes of first source
  • red2 (str|int|tuple[str|int]|list[str|int]) – reduce axes of second source
  • var1 (str|int|tuple[str|int]|list[str|int]|None) – var axes of first source
  • var2 (str|int|tuple[str|int]|list[str|int]|None) – var axes of second source
  • add_var2_if_empty (bool) –
Return type:

Data

Elementwise Product Layer

class returnn.tf.layers.basic.ElemwiseProdLayer(axes, size=None, **kwargs)[source]

Element-wise product in some axes. Microsoft calls this “static attention”, in Deep Conv. NN with Layer-wise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also LinearLayer.

Parameters:
  • axes (str|list[str]) – e.g. “spatial”, but all those axes must be of fixed dimension
  • size (tuple[int]) – for double-checking, you can explicitly provide the size
layer_class = 'elemwise_prod'[source]
classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
Return type:

Data

Gating Layer

class returnn.tf.layers.basic.GatingLayer(activation, gate_activation='sigmoid', **kwargs)[source]

Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then element-wise multiplies them. Thus, the output dimension is input-dimension / 2.

layer_class = 'gating'[source]
classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • n_out (int|None|NotSpecified) –
Return type:

Data

Linear Layer

class returnn.tf.layers.basic.LinearLayer(activation, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]

Linear/forward/fully-connected/1x1-conv layer. Does a linear transformation on the feature-dimension of the input with an optional bias term and an optional activation function. See also DotLayer, ElemwiseProdLayer, WeightedSumLayer.

Parameters:
  • activation (str|None) – e.g. “relu”, or None
  • with_bias (bool) –
  • grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str|float) – see TFUtil.get_initializer()
  • use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).
layer_class = 'linear'[source]

Pooling Layer

class returnn.tf.layers.basic.PoolLayer(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, use_channel_first=False, **kwargs)[source]

A generic N-D pooling layer. This would usually be done after a convolution for down-sampling.

Parameters:
  • mode (str) – “max” or “avg”
  • pool_size (tuple[int]) – shape of the window of each reduce
  • padding (str) – “valid” or “same”
  • dilation_rate (tuple[int]|int) –
  • strides (tuple[int]|int|None) – in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size
  • use_channel_first (bool) – if set, will transform input to NCHW format
layer_class = 'pool'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, pool_size, strides=None, dilation_rate=1, sources=(), padding='VALID', use_channel_first=False, **kwargs)[source]
Parameters:
  • name (str) –
  • pool_size (tuple[int]|list[int]) –
  • strides (tuple[int]|list[int]|int) –
  • dilation_rate (int|tuple[int]|list[int]) –
  • sources (list[LayerBase]) –
  • padding (str) –
  • use_channel_first (bool) –
Return type:

Data

Reduce Layer

class returnn.tf.layers.basic.ReduceLayer(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]

This reduces some axis by using “sum” or “max”. It’s basically a wrapper around tf.reduce_sum or tf.reduce_max.

Parameters:
  • mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
  • axes (int|list[int]|str) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See Data.get_axes_from_description().
  • axis (int|list[int]|str) – for compatibility, can be used instead of axes
  • keep_dims (bool) – if dimensions should be kept (will be 1)
  • enforce_batch_dim_axis (int) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
  • use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.
layer_class = 'reduce'[source]
classmethod need_enforce_batch_dim_axis(axes)[source]
Parameters:axes (int|list[int]|str) –
Returns:if any integer is in axes, thus we should have a fixed dimension layout
Return type:bool
classmethod get_axes(axis, input_data)[source]
Parameters:
  • axis – see self.__init__()
  • input_data (Data) –
Returns:

list of axes

Return type:

list[int]

classmethod get_out_data_from_opts(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • mode (str) – (default here “” because other code uses this function)
  • axes (str|list[str]|None) –
  • axis (str|None) –
  • keep_dims (bool) –
  • enforce_batch_dim_axis (int|None) –
Return type:

Data

Reduce-Out Layer

class returnn.tf.layers.basic.ReduceOutLayer(mode, num_pieces, **kwargs)[source]

Combination of SplitDimsLayer applied to the feature dim and ReduceLayer applied to the resulting feature dim. This can e.g. be used to do maxout.

Parameters:
  • mode (str) – “sum” or “max” or “mean”
  • num_pieces (int) – how many elements to reduce. The output dimension will be input.dim // num_pieces.
layer_class = 'reduce_out'[source]
classmethod get_out_data_from_opts(num_pieces, sources, name, **kwargs)[source]
Parameters:
  • num_pieces (int) –
  • sources (list[LayerBase]) –
  • name (str) –
Return type:

Data

Switch Layer

class returnn.tf.layers.basic.SwitchLayer(condition, true_from, false_from, **kwargs)[source]

Wrapper around tf.where() (or more generically TFUtil.where_bc()), or statically choose a single source if the condition is a callable (…)->bool. (tf.cond is not useful here, as the sources would have been already constructed and computed.) See also CondLayer.

Parameters:
  • condition (LayerBase|bool) – if callable, expected to be (…)->bool, and called in transform_config_dict
  • true_from (LayerBase|None) –
  • false_from (LayerBase|None) –
layer_class = 'switch'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (returnn.tf.network.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(name, condition, true_from, false_from, **kwargs)[source]
Parameters:
  • name (str) –
  • condition (LayerBase|bool) –
  • true_from (LayerBase|None) –
  • false_from (LayerBase|None) –
Return type:

Data

get_dep_layers()[source]
Return type:list[LayerBase]

Variable Layer

class returnn.tf.layers.basic.VariableLayer(shape, dtype='float32', add_batch_axis=True, add_time_axis=False, trainable=True, init=0, **kwargs)[source]

Represents a variable. Can add batch/time dimension if wanted. Can be trainable. See defaults.

Parameters:
  • shape (tuple[int]|list[int]) –
  • dtype (str) –
  • add_batch_axis (bool) –
  • add_time_axis (bool) –
  • trainable (bool) –
  • init (str|float|int) – see TFUtil.get_initializer()
layer_class = 'variable'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (returnn.tf.network.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(name, shape, dtype='float32', add_batch_axis=True, add_time_axis=False, **kwargs)[source]
Parameters:
  • name (str) –
  • shape (tuple[int]|list[int]) –
  • dtype (str) –
  • add_batch_axis (bool) –
  • add_time_axis (bool) –
Return type:

Data

Weighted Sum Layer

class returnn.tf.layers.basic.WeightedSumLayer(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]

Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use ElemwiseProdLayer and ReduceLayer, or just a DotLayer with a VariableLayer. See also LinearLayer.

Parameters:
  • axes (str|list[str]) – the axes to do the weighted-sum over
  • padding (str) – “valid” or “same”, in case of keep_dims=True
  • size (None|tuple[int]) – the kernel-size. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=”valid” by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.
  • keep_dims (bool) – if False, the axes will be squeezed away. see also size.
layer_class = 'weighted_sum'[source]
classmethod get_out_data_from_opts(name, sources, axes, padding=None, size=None, keep_dims=None, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • axes (str|list[str]) –
  • padding (str|None) –
  • size (None|tuple[int]) –
  • keep_dims (bool|None) –
Return type:

Data

Window Layer

class returnn.tf.layers.basic.WindowLayer(window_size, window_left=None, window_right=None, axis='T', padding='same', **kwargs)[source]

Adds a window dimension. By default, uses the time axis and goes over it with a sliding window. The new axis for the window is created right after the time axis. Will always return as batch major mode. E.g. if the input is (batch, time, dim), the output is (batch, time, window_size, dim). If you want to merge the (window_size, dim) together to (window_size * dim,), you can use the MergeDimsLayer, e.g. {“class”: “merge_dims”, “axes”: “except_time”}.

This is not to take out a window from the time-dimension. See SliceLayer or SliceNdLayer.

Parameters:
  • window_size (int) –
  • window_left (int|None) –
  • window_right (int|None) –
  • axis (str|int) – see Data.get_axis_from_description()
  • padding (str) – “same” or “valid”
  • kwargs
layer_class = 'window'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, window_size, axis='T', sources=(), **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • window_size (int) –
  • axis (str) –
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, window_size, axis='T', sources=(), **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • rec_layer (TFNetworkRecLayer.RecLayer|LayerBase) –
  • window_size (int) –
  • axis (str) –
  • sources (list[LayerBase]) –
Return type:

dict[str,tf.Tensor]