returnn.tf.layers.basic
¶
Many canonical basic layers.
- class returnn.tf.layers.basic.SourceLayer(network, data_key=None, sources=(), **kwargs)[source]¶
This gives access to some entry from network.extern_data (
ExternData
).- Parameters:
network (returnn.tf.network.TFNetwork)
data_key (str|None)
sources (tuple)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- classmethod get_out_data_from_opts(network, data_key=None, **kwargs)[source]¶
- Parameters:
network (returnn.tf.network.TFNetwork)
data_key (str|None)
- Return type:
Data
- returnn.tf.layers.basic.concat_sources(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]¶
- Parameters:
src_layers (list[LayerBase])
out_dim (Dim|None)
allow_broadcast_all_sources (bool|NotSpecified)
- Returns:
data with placeholders set
- Return type:
Data
- returnn.tf.layers.basic.get_concat_sources_data_template(src_layers, out_dim=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, name=None)[source]¶
This just creates a template
Data
instance, without creating any real TF tensors.concat_sources()
(and related) are the equivalent functions which would create aData
together with the tensor.- Parameters:
src_layers (Sequence[LayerBase])
out_dim (Dim|None)
allow_broadcast_all_sources (bool|NotSpecified)
name (str|None) – name of the Data
- Returns:
data with no placeholders set. it is always a copy or new instance, so safe to manipulate
- Return type:
Data
- returnn.tf.layers.basic.concat_sources_with_opt_dropout(src_layers, out_dim=None, dropout=0, dropout_axis=None, dropout_noise_shape=None, dropout_on_forward=False, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>)[source]¶
Concatenates in the feature dim (see
concat_sources()
), and then optionally applies dropout.- Parameters:
src_layers (list[LayerBase])
out_dim (Dim|None)
dropout (float) – dropout rate that will be applied if train_flag is set or dropout_on_forward is enabled
dropout_noise_shape (tuple|list|dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – provide 1 for broadcasting or None otherwise for each axis. The default “None” will broadcast across all dynamic axes including the batch axis. Use {“*”: None} to disable broadcasting for all axes.
dropout_on_forward (bool) – apply dropout also during inference
allow_broadcast_all_sources (bool|NotSpecified)
- Returns:
data with placeholders set
- Return type:
Data
- class returnn.tf.layers.basic.CopyLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]¶
This layer does nothing, it copies its input. This is not even a
tf.identity
. It refers to the same TF tensor. If multiple sources are provided, they are concatenated in the feature-dim.- Parameters:
in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.
out_dim (Dim|None) – alternative to in_dim. see in_dim doc.
extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via
SelectSearchSourcesLayer
. We only have this here for theCopyLayer
because theget_out_data_from_opts()
must know about it and define the right beam. Also see the optioncollocate_with
, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, useIdentityLayer
.
- classmethod get_out_data_from_opts(name, sources=(), extra_deps=(), out_type=None, in_dim=None, out_dim=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.IdentityLayer(sources: List[LayerBase], control_dependencies: Sequence[LayerBase] | None = None, **kwargs)[source]¶
Wraps
tf.identity
with potential control dependencies.The difference to
CopyLayer
is that this creates a new TF op (tf.identity
), which allows for potential control dependencies. This is the whole purpose of this layer.Usually the arguments, when specified in the network dict, are going through
transform_config_dict()
, before they are passed to here. SeeTFNetwork.construct_from_dict()
.- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using
get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See
Data.verify_out_shape()
.sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via
transform_config_dict()
. Every layer can have one loss (of typeLoss
), or none loss. In the net dict, it is specified as a string. InTFNetwork
, all losses from all layers will be collected. That is whatTFUpdater.Updater
will use for training.reuse_params (ReuseParams|None) – if given, will opt reuse the params. see
self.var_creation_scope()
. See also thename_scope
option as an alternative.name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in
get_absolute_name_scope_prefix()
andTFNetwork.layer_creation_scope()
.param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see
returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a
RecLayer
, it triggers the explicit accumulation of all frames. Also see theneed_last
option.only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside
RecLayer
, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via
RecLayer
.encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,
and covered in functions like
get_rec_initial_extra_outputs()
, and the logic incls_get_sub_network()
will not be used.If False, the logic in
cls_get_sub_network()
will be used.collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by
set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the
IdentityLayer
with the optioncontrol_dependencies
.debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as
name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as
network
_src_common_search_choices (None|SearchChoices) – set via
SearchChoices.translate_to_common_search_beam()
- class returnn.tf.layers.basic.ConcatLayer(sources, allow_broadcast=False, out_dim=None, **kwargs)[source]¶
Concatenates the inputs in specified axes. This generalizes
CopyLayer
which concatenates in the feature dim.- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.DropoutLayer(in_dim=None, out_dim=None, extra_deps=(), **kwargs)[source]¶
Just the same as
CopyLayer
, because that one already supports dropout.- Parameters:
in_dim (Dim|None) – just for checking. but also, if this is provided, it will set the feature_dim to this.
out_dim (Dim|None) – alternative to in_dim. see in_dim doc.
extra_deps (list[LayerBase]) – Just add as an additional dependency, without really using it. This can have an effect though on the search beam, via
SelectSearchSourcesLayer
. We only have this here for theCopyLayer
because theget_out_data_from_opts()
must know about it and define the right beam. Also see the optioncollocate_with
, which is different in that it does not add a dependency. Note that this will not be real TF control dependencies, but it simply sets the dependency on the layer. If you want to have a real TF control dependency, useIdentityLayer
.
- class returnn.tf.layers.basic.ScaledGradientLayer(scale, shift=None, scale_shift_by_sum_over_axis=None, clip_max_axis=None, **kwargs)[source]¶
Just
tf.identity()
in the forward pass. Scales the gradient by some factor in backprop. Can be used as gradient reversal layer (with negative factor). Usesreturnn.tf.util.basic.scaled_gradient()
, ortf.stop_gradient()
- Parameters:
scale (float|LayerBase) – if 0. and no shift, will use tf.stop_gradient
shift (float|LayerBase|None)
scale_shift_by_sum_over_axis (Dim|str|None) – if given, calculates the sum over this axis (absolute values) and multiplies the shift value by this sum.
clip_max_axis (Dim|str|None) – if given, clips the gradient to the max value in this axis before the transformation, for all values in the axis
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.SelectSearchSourcesLayer(search_choices_layer, sources, **kwargs)[source]¶
Selects the corresponding search beams from the source, given current search choices (determined by a layer). Like
InternalLayer
, only for internal purpose at the moment.- classmethod select_if_needed(layer, search_choices)[source]¶
- Parameters:
layer (LayerBase)
search_choices (SearchChoices|None)
- Return type:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ActivationLayer(activation, opts=None, **kwargs)[source]¶
This layer just applies an activation function. See
returnn.tf.util.basic.get_activation_function()
about supported functions. Also seeEvalLayer
andCombineLayer
for similar layers.- Parameters:
activation (str) – e.g. “relu”, “tanh”, etc
opts (dict[str]|None) – for activation function, e.g. eps for safe_log
- class returnn.tf.layers.basic.BatchNormLayer(in_dim=None, use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
Implements batch-normalization (https://arxiv.org/abs/1502.03167) as a separate layer.
Also see
NormLayer
.- Parameters:
in_dim (returnn.tensor.Dim|None)
use_shift (bool)
use_std (bool)
use_sample (float) – defaults to 0.0 which is used in training
force_sample (bool) – even in eval, use the use_sample factor
momentum (float) – for the running average of sample_mean and sample_std
update_sample_only_in_training (bool)
delay_sample_update (bool)
param_version (int) – 0 or 1 or 2
epsilon (float)
gamma_init (str|float) – see
returnn.tf.util.basic.get_initializer()
, for the scalebeta_init (str|float) – see
returnn.tf.util.basic.get_initializer()
, for the meanmasked_time (bool) – flatten and mask input tensor
The default settings for these variables are set in the function
batch_norm()
ofLayerBase
. If you do not want to change them you can leave them undefined here. With our default settings:In training: use_sample=0, i.e. not using running average, using current batch mean/var.
Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.
The running average includes the statistics of the current batch.
The running average is also updated when not training.
- class returnn.tf.layers.basic.LayerNormLayer(in_dim=None, out_dim=None, epsilon=1e-06, **kwargs)[source]¶
Applies layer-normalization.
Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of
tf.keras.layers.LayerNormalization
and also how it is commonly used in many models, including Transformer.However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see
NormLayer
.- Parameters:
- class returnn.tf.layers.basic.NormLayer(axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, param_shape=<class 'returnn.util.basic.NotSpecified'>, scale=True, bias=True, epsilon=1e-06, **kwargs)[source]¶
Normalize over specified axes, e.g. time and/or feature axis.
Note: For calculating a norm, see
MathNormLayer
instead.In case of just feature (
axes="F"
), this corresponds to layer normalization (seeLayerNormLayer
). In case of time and feature (axes="TF"
) for a 3D input, or more general all except batch (axes="except_batch"
), this corresponds to group normalization with G=1, or non-standard layer normalization. (The definition of layer-normalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see
BatchNormLayer
).For some discussion on the definition of layer-norm vs group-norm, also see here and here.
- Parameters:
axis (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”
axes (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”
param_shape (Dim|str|list[Dim|str]|tuple[Dim|str]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the feature-dim. This is also the default, i.e. a param-shape of [F], independent of the axes to normalize over.
scale (bool) – add trainable scale parameters
bias (bool) – add trainable bias parameters
epsilon (float) – epsilon for numerical stability
- class returnn.tf.layers.basic.MathNormLayer(p, axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, keep_dims=False, **kwargs)[source]¶
Calculates sum(abs(x) ** p) ** (1./p).
- Parameters:
- class returnn.tf.layers.basic.SliceLayer(axis, slice_start=None, slice_end=None, slice_step=None, out_dim=None, **kwargs)[source]¶
Slicing on the input, i.e. x[start:end:step] in some axis. See also
SliceNdLayer
, for variable start. See alsoGatherLayer
, for one single position.Note that __getitem__ on a TF tensor (or also Numpy ND array) is more generic, and supports slices in multiple axes, as well as adding new dimensions, etc. It even allows to get boolean values, and then applies a boolean mask. See TF _slice_helper (== tf.Tensor.__getitem__) for a generic implementation, which calls tf.strided_slice. If we ever need such more generic support, we might consider adding a new layer, like
GenericSliceLayer
, which gets asplice_spec
, just like_slice_helper
(argument to__getitem__
). But any such a slice can already be constructed with multiple individual layers, which perform individual slices (per axis).We just support slicing in a single axis here, with optional striding (slice_step).
- Parameters:
- class returnn.tf.layers.basic.SliceNdLayer(size, start=None, min_size=None, axis='T', out_spatial_dim=None, **kwargs)[source]¶
This takes out a slice-range from the time axis, e.g.
x[start:start + size]
. If the input is of shape (B,T,F) and start is of shape (B,), then the output will be of shape (B,size,F). If the input is of shape (B,T,F) and start is of shape (B,T), then the output will be of shape (B,T,size,F). This layer allows a different start slice point for each batch, in contrast toSliceLayer
, and the start is variable. See alsoGatherNdLayer
.PrefixInTimeLayer
can recover the original shape (by zero-padding).- Parameters:
start (int|LayerBase|None) – (B,…)
size (int|LayerBase|Dim|None) – We assume that this is >=0. If this might not be the case, use
min_size=0
. If None, it uses the max possible size, and it becomes a dynamic axis.min_size (int|None) – if size is None, but we want to have a min-size
axis (Dim|str)
out_spatial_dim (Dim|None)
- classmethod get_out_data_from_opts(name, sources=(), start=None, size=None, axis='T', out_spatial_dim=None, **kwargs)[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.GatherLayer(position: LayerBase | int, axis: Dim | str, clip_to_valid: bool = False, **kwargs)[source]¶
Gathers slices on a specified axis from the input layer using indices from a
position
layer. If the input is a layer of the shape[B,D,F1]
, and position of shape[B,F2]
, this will yield output of the shape[B,F2,F1]
whereoutput[b,f2,f1] = input[b,position[b,f2],f1]
(if
D
is the axis to gather from). In general, all shared axes of the input and the positions will be considered as batch-axes.The
position
argument can also be anint
. In this case, this simply givesinput[position]
one the specifiedaxis
.It’s basically a wrapper around
tf.gather
. It provides the same functionality as the deprecatedGatherNdLayer
, but is more generic. See alsoGatherNdLayer
.- Parameters:
position – indices used to select the slices of the input from. If another layer, must be of type
int32
orint64
. Can also specify a constantint
.axis – The axis into which we gather the indices into
clip_to_valid – if True, the indices will be clipped to the valid range of the input Also taking seq lengths into account.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.GatherNdLayer(position, **kwargs)[source]¶
Warning: This layer is deprecated, use the more general
GatherLayer
instead.GatherLayer
should be equivalent, but is more general (supports multiple batch dimensions, can specify gather axis) and its name is less misleading.This takes out a position from some axis, e.g.
x[pos]
. This layers allows a different position for each batch. It’s basically a wrapper aroundtf.gather
(the name of this layer is misleading). See alsoGatherLayer
instead, which will replace this layer in the future. See alsoSliceNdLayer
. See alsoScatterNdLayer
, which is the inverse operation.- Parameters:
position (LayerBase) – indices into first axis (excluding batch) of the input
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ScatterNdLayer(position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, filter_invalid_indices=False, **kwargs)[source]¶
The inverse of
GatherNdLayer
. Mostly a wrapper fortf.scatter_nd
.Note that “nd” is maybe a bit misleading. While we operate on N-D tensors, the indices (
position
) are into a single new dimension.The input to the layer are the
updates
, theindices
are via theposition
argument. The indices are into the newly constructed output dimension. The output shape is constructed via the common shape of the input, the position, and the unique common axis (if not unique, we would need to introduce an option to specify it) is replaced by the given output dimension (currently viaoutput_dim_via_time_from
).Examples:
position (indices): (B,eTs) input (updates): (eTs,D) or (B,eTs,D) -> expanded to (B,eTs,D) output shape: (B,eT,D) position (indices): (B,dT,eTs) input (updates): (eTs,D) -> expanded to (B,dT,eTs,D) output shape: (B,dT,eT,D) position (indices): (dT,eTs) input (updates): (eTs,D) -> expanded to (dT,eTs,D) output shape: (dT,eTs,D) position (indices): (dT,eTs) input (updates): (B,eTs,D) -> expanded to (dT,eTs,B,D) output shape: (dT,eT,B,D)
In all these examples, output_dim_via_time_from is (B,eT,F), and eTs gets replaced by eT.
- Parameters:
position (LayerBase) – indices into first axis (excluding batch) of the output
position_axis (Dim|str) – axis in position to replace by the output-dim
output_dim_via_time_from (LayerBase|None) – use the time-dim from this layer as the output-dim
out_spatial_dim (Dim|None)
filter_invalid_indices (bool) – allow for indices <0 or >= output_dim, which will be discarded in the output
- classmethod get_out_data_from_opts(name, sources, position, position_axis, output_dim_via_time_from=None, out_spatial_dim=None, **kwargs)[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)
- class returnn.tf.layers.basic.LinearLayer(activation=None, with_bias=True, grad_filter=None, forward_weights_init='glorot_uniform', bias_init=0.0, use_transposed_weights=False, **kwargs)[source]¶
Linear/forward/fully-connected/1x1-conv layer. Does a linear transformation on the feature-dimension of the input with an optional bias term and an optional activation function. See also
DotLayer
,ElemwiseProdLayer
,WeightedSumLayer
.- Parameters:
activation (str|None) – e.g. “relu”, or None
with_bias (bool)
grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
forward_weights_init (str) – see
returnn.tf.util.basic.get_initializer()
recurrent_weights_init (str) – see
returnn.tf.util.basic.get_initializer()
bias_init (str|float) – see
returnn.tf.util.basic.get_initializer()
use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).
- class returnn.tf.layers.basic.SoftmaxLayer(**kwargs)[source]¶
Just a LinearLayer with activation=”softmax” by default.
- Parameters:
activation (str|None) – e.g. “relu”, or None
with_bias (bool)
grad_filter (float|None) – if grad norm is higher than this threshold (before activation), the grad is removed
forward_weights_init (str) – see
returnn.tf.util.basic.get_initializer()
recurrent_weights_init (str) – see
returnn.tf.util.basic.get_initializer()
bias_init (str|float) – see
returnn.tf.util.basic.get_initializer()
use_transposed_weights (bool) – If True, define the weight matrix with transposed dimensions (n_out, n_in).
- class returnn.tf.layers.basic.LengthLayer(axis='T', add_time_axis=False, dtype='int32', sparse=False, **kwargs)[source]¶
Returns the length of sources as (B,), via input size_placeholder.
- Parameters:
axis (str|Dim)
add_time_axis (bool) – should not be used
dtype (str)
sparse (bool)
- class returnn.tf.layers.basic.SoftmaxOverSpatialLayer(axis=None, energy_factor=None, start=None, window_start=None, window_size=None, use_time_mask=None, log_space=False, **kwargs)[source]¶
This applies a softmax over spatial axis/axes (currently only time axis supported). E.g. when the input is of shape (B,T,dim), the output will be (B,T,dim). It automatically masks the frames outside the seq defined by the seq-len. In contrast to
SoftmaxLayer
, this will not do a linear transformation. SeeSeqLenMaskLayer
if you just want to apply a masking.- Parameters:
axis (Dim|str|None) – which axis to do the softmax over. “T” by default
energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
start (LayerBase|None) – Tensor of shape (B,) indicating the start frame
window_start (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window start.
window_size (LayerBase|int|None) – Layer with output of shape (B,) or (constant) int value indicating the window size.
use_time_mask (bool) – if True, assumes dyn seq len, and use it for masking. By default, if dyn seq len exists, it uses it.
log_space (bool) – if True, returns in log space (i.e. uses log_softmax)
- classmethod get_out_data_from_opts(name, sources, axis=None, start=None, window_start=None, window_size=None, **kwargs)[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.SeqLenMaskLayer(mask_value, axis='T', seq_len_source=None, start=None, window_start=None, window_size=None, **kwargs)[source]¶
Masks some values away given the seq_len_source with mask_value. Also see
SoftmaxOverSpatialLayer
. Also seeSwitchLayer
, which can be used to apply a generic mask.- Parameters:
- classmethod build_mask(x, axis='T', axis_allow_int=<class 'returnn.util.basic.NotSpecified'>, seq_len_source=None, start=None, window_start=None, window_size=None)[source]¶
- Parameters:
x (Data)
axis (Dim|str|int)
axis_allow_int (bool|NotSpecified) – Some callers of this function would pass in an int for axis directly. In that case, explicitly set this to True.
seq_len_source (Data|None)
start (Data|None)
window_start (Data|None)
window_size (Data|int|None)
- Returns:
mask which is broadcastable to energy_data, thus you can e.g. use
returnn.tf.util.basic.where_bc()
- Return type:
tf.Tensor
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.BooleanMaskLayer(*, mask: LayerBase, dims: Sequence[Dim], out_dim: Dim | None = None, **kwargs)[source]¶
Wrapper around tf.boolean_mask.
- Parameters:
mask
dims
out_dim
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.RandomStateInitLayer(algorithm=None, seed=None, out_dim=None, **kwargs)[source]¶
This calculates the initial state value for the state var of
RandomLayer
. This depends on the algorithm and seed.- Parameters:
algorithm (str|tf.random.Algorithm|None) – “philox”, “three-fry”, “auto-select”. by default “philox”. See
tf.random.stateless_uniform()
for some documentation. “auto-select” will automatically select the optimal algorithm based on the device, so it might select a different algorithm depending on the device. Note that the state shape is dependent on the device, so if you want that checkpoints are compatible across devices, do not use “auto-select”. We take the default fromtf.random.Generator
.seed (int|Sequence[int]|numpy.ndarray|None) – if given, the state will deterministically depend on this (and the algorithm) and nothing else. If you have multiple random generators (state vars), make sure that you have different seeds for each! If None (default), the seed will be deterministically taken from the network random generator at construction time, which is usually a good idea. You still can change the global network seed.
out_dim (Dim|None) – new dim tag for random state dim
- classmethod select_algorithm(algorithm)[source]¶
- Parameters:
algorithm (str|int|tf.random.Algorithm|None)
- Return type:
int
- classmethod get_out_data_from_opts(name, algorithm=None, out_dim=None, **kwargs)[source]¶
- Parameters:
name (str)
algorithm (str|None)
out_dim (Dim|None)
- Return type:
Data
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.RandomLayer(shape, distribution, mean=None, stddev=None, bound=None, minval=None, maxval=None, dtype='float32', sparse_dim=None, feature_dim=None, seed=None, algorithm=None, explicit_state=None, auto_update_state=None, static=None, shape_deps=(), stop_grad: bool = False, **kwargs)[source]¶
Generates random numbers from uniform or normal or truncated normal distribution.
This uses the TensorFlow stateless random ops internally, i.e. all the state handling is explicit. The state var can be explicitly provided and initialized via
RandomStateInitLayer
, or when not provided it will be automatically created.There are two possible distinct use cases:
For any randomness in the model, e.g. dropout. So each
session.run
step will produce a new random number and advance the random state.To initialize parameters via the config, using
VariableLayer
with theinit_by_layer
option. This will only be called once when initializing the parameters. For this use case, we do not want to keep a random state var. You can just passstatic=False
. Alternatively you could also pass the output of aRandomStateInitLayer
asstate
.
- Parameters:
shape (Sequence[Dim|int])
distribution (str) – “uniform”, “normal” or “truncated_normal”
mean (int|float|LayerBase|None)
stddev (int|float|LayerBase|None)
bound (int|float|LayerBase|None) – for uniform, defining the range [-bound, bound)
minval (int|float|LayerBase|None) – for uniform
maxval (int|float|LayerBase|None) – for uniform
dtype (str)
sparse_dim (Dim|None)
feature_dim (Dim|None)
seed (int|list[int]|numpy.ndarray|None) – If not given, uses self.network.random.randint, i.e. then it is controlled by the global seed setting, and every layer would get its own seed. If you specify it explicitly, make sure every
RandomLayer
uses a different seed, otherwise you would get the same random numbers everywhere.algorithm (str|tf.random.Algorithm|None) – see
RandomStateInitLayer
explicit_state (LayerBase|None) – You can pass the state explicitly here. If not given, will be created automatically, and updated automatically. You could pass a
VariableLayer
with initial value viaRandomStateInitLayer
, or directly aRandomStateInitLayer
. If auto_update_state is True, it must be a variable, and every time a new random number is created, this variable is updated. Otherwise (default) it will not be updated automatically.auto_update_state (bool|None) – only used when you pass an explicit state
static (bool|None) – if no state at all should be used. it just relies on the seed then.
shape_deps (list[LayerBase]) – for dyn dim tags in shape
stop_grad (bool) – if True, will stop the gradient to mean,stddev,bound,minval,maxval
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.RandIntLayer(shape, maxval, minval=0, dtype='int32', sparse_dim=None, seed=None, **kwargs)[source]¶
Generates random integer numbers using
tf.random.uniform
. It is recommended to useRandomLayer
instead.- Parameters:
shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor
maxval (int|LayerBase) – upper bound (exclusive) on range of random values
minval (int|LayerBase) – lower bound (inclusive) on range of random values
dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats
sparse_dim (Dim|None)
seed (int|None) – random seed
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)
- classmethod get_out_data_from_opts(name, network, shape, maxval, minval=0, dtype='int32', sparse_dim=None, **kwargs)[source]¶
- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
shape (tuple[Dim|int]|list[Dim|int]) – desired shape of output tensor
maxval (int|LayerBase) – upper bound (exclusive) on range of random values
minval (int|LayerBase) – lower bound (inclusive) on range of random values
dtype (str) – type of the output. For random ints, int32 and int64 make sense, but could also be floats
sparse_dim (Dim|None)
- Return type:
Data
- class returnn.tf.layers.basic.RangeLayer(limit, start=0, delta=1, dtype=None, sparse=False, out_spatial_dim=None, **kwargs)[source]¶
Generic wrapper around
tf.range
. See alsoRangeInAxisLayer
.- Parameters:
limit (int|float)
start (int|float)
delta (int|float)
dtype (str|None)
sparse (bool)
out_spatial_dim (Dim|None)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)
- class returnn.tf.layers.basic.RangeInAxisLayer(axis, dtype='int32', unbroadcast=False, keepdims=False, sparse=False, **kwargs)[source]¶
Assume that the input is e.g. (B,T,D), and you specify axis=”T”, you will get (T,), where the specified axis is filled with
tf.range
. See alsoRangeLayer
.- Parameters:
axis (str|Dim)
dtype (str)
unbroadcast (bool) – DEPRECATED, unsupported, and not needed
keepdims (bool) – DEPRECATED, unsupported, and not needed
sparse (bool)
- class returnn.tf.layers.basic.RangeFromLengthLayer(dtype='int32', sparse=False, out_spatial_dim=None, **kwargs)[source]¶
Given some dynamic sequence lengths as input, this creates a tf.range over the implied dimension. As a side effect, this can create a new dyn dim tag for the given sequence lengths. This side effect can be the main functionality in certain use cases. See also
RangeInAxisLayer
.Consider the example:
y: {class: range_in_axis, from: x, axis: T}
This is basically equivalent to:
x_len: {class: length, from: x} y: {class: range_from_length, from: x_len}
- Parameters:
axis (str)
dtype (str)
sparse (bool)
out_spatial_dim (Dim|None)
- class returnn.tf.layers.basic.BatchSoftmaxLayer(**kwargs)[source]¶
Softmax over spacial and feature axis
- Parameters:
in_dim (Dim|None)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see
Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons
- class returnn.tf.layers.basic.ConstantLayer(sources, value=0.0, shape=None, dtype=None, with_batch_dim=False, sparse_dim=None, feature_dim=None, shape_deps=(), **kwargs)[source]¶
Output is a constant value.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.GatingLayer(activation, gate_activation='sigmoid', out_dim=None, **kwargs)[source]¶
Splits the output into two equal parts, applies the gate_activation (sigmoid by default) on the one part, some other activation (e.g. tanh) on the other part and then element-wise multiplies them. Thus, the output dimension is input-dimension / 2.
- Parameters:
activation (str)
gate_activation (str)
out_dim (Dim|None)
- classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=None, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
n_out (int|None|NotSpecified)
out_dim (Dim|None)
- Return type:
Data
- class returnn.tf.layers.basic.WindowLayer(window_size=None, window_dim=None, window_left=None, window_right=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]¶
Adds a window dimension. By default, uses the time axis and goes over it with a sliding window. The new axis for the window is created right after the time axis. In PyTorch, this is called
unfold
. We sometimes call this “chunking”. There is also the similarTimeChunkingLayer
.E.g. if the input is (batch, time, dim), the output is (batch, time, window_size, dim). If you want to merge the (window_size, dim) together to (window_size * dim,), you can use the MergeDimsLayer, e.g. {“class”: “merge_dims”, “axes”: “except_time”}.
Use stride==window_size and window_right=window_size - 1 in combination with a MergeDimsLayer to achieve feature stacking with right-hand zero padding.
This is not to take out a single window from the time-dimension. See
SliceLayer
orSliceNdLayer
.The inverse layer is
FoldLayer
.- Parameters:
- classmethod get_out_data_from_opts(name, network, sources, window_size=None, window_dim=None, axis='T', out_spatial_dim=None, padding='same', stride=1, _use_opt_dim_order=None, **kwargs)[source]¶
- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
window_size (int|None)
window_dim (Dim|None)
axis (Dim|str)
out_spatial_dim (Dim|None)
padding (str)
stride (int)
_use_opt_dim_order (bool|None)
- Return type:
Data
- classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, window_size=None, window_dim=None, axis='T', sources=(), **kwargs)[source]¶
- Parameters:
network (returnn.tf.network.TFNetwork)
batch_dim (tf.Tensor)
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase)
window_size (int|None)
window_dim (Dim|None)
axis (Dim|str)
sources (list[LayerBase])
- Return type:
dict[str,tf.Tensor]
- class returnn.tf.layers.basic.FoldLayer(mode: str, in_spatial_dim: Dim | str, window_dim: Dim | str, out_spatial_dim: Dim | None = None, padding: str = 'same', window_left: int | None = None, window_right: int | None = None, stride: int = 1, **kwargs)[source]¶
The inverse of
WindowLayer
. We sometimes call this “unchunking”. TheTimeUnChunkingLayer
is similar.Input (in_spatial_dim, window_dim, other_dims…) -> output (out_spatial_dim, other_dims…).
The window_dim is folded into the out_spatial_dim. This is also similar as the PyTorch fold operation (with mode=”sum”).
- Parameters:
mode – “sum” or “mean” (average), for overlapping frames
in_spatial_dim
window_dim
out_spatial_dim
padding
window_left
window_right
stride
- class returnn.tf.layers.basic.CumsumLayer(axis='T', additional_left_summand_per_element=None, reverse=False, **kwargs)[source]¶
Basically wraps tf.cumsum. Also supports that in the RecLayer.
- Parameters:
axis (str) – see
Data.get_axis_from_description()
additional_left_summand_per_element (str|int|float|None) – the order matters for tf.string
reverse (bool)
- classmethod get_out_data_from_opts(name, sources, axis='T', **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
axis (str)
- Return type:
Data
- classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, axis='T', sources=(), **kwargs)[source]¶
- Parameters:
network (returnn.tf.network.TFNetwork)
batch_dim (tf.Tensor)
rec_layer (returnn.tf.layers.rec.RecLayer|LayerBase)
axis (str)
sources (list[LayerBase])
- Return type:
dict[str,tf.Tensor]
- class returnn.tf.layers.basic.PadLayer(*, axes: Dim | str | Sequence[Dim | str], padding: int | Dim | Tuple[int | Dim, int | Dim] | Sequence[Tuple[int | Dim, int | Dim]], out_dims: Dim | Sequence[Dim] | None = None, handle_dynamic_dims: bool | None = None, value: int | float = 0, mode: str = 'constant', **kwargs)[source]¶
Adds (e.g. zero) padding in some axis or axes. Also see
PrefixInTimeLayer
for dynamic dims.- Parameters:
axes – e.g. “F” etc. see
Data.get_axes_from_description()
.padding – how much to pad left/right in each axis
out_dims
handle_dynamic_dims – True: when doing right padding on a dynamic dim, value will be added after the seq end, not at the end of the dimension. False: value will be added at the end of the dimension. By default, in behavior version >=21, this is True, in older versions, this is False.
value – what constant value to pad, with mode==”constant”
mode – “constant”, “reflect”, “symmetric” and “replication”
- class returnn.tf.layers.basic.MergeDimsLayer(axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, n_out=None, out_dim=None, **kwargs)[source]¶
Merges a list of axes into a single one. (Flatten the dims.) E.g. input is (batch, width, height, dim) and axes=(1,2), then we get (batch, width*height, dim). Or input is (batch, time, height, dim) and axes=”except_time”, then we get (batch, time, height*dim). See also
CombineDimsLayer
. When batch and time got merged,SplitBatchTimeLayer
can undo this. When you want to merge batch and time, but remove the padding efficiently, i.e. flatten it, seeFlattenBatchLayer
.- Parameters:
axes (Sequence[Dim|str]) – see
Data.get_axis_from_description()
keep_order (bool|NotSpecified) – The old default was: the axes are sorted, and then merged. Thus, the order of incoming axes will influence the result. E.g. inputs [B,S,F] and [B,F,S], with
axes=["S","F"]
, will get different results, although the output shape is [B,S*F] in both cases. This is bad: In general, other layers in RETURNN might reorder the axes for various reasons, and all layers should behave in the same way, no matter the order. It is recommended to setkeep_order=True
, such that the order defined inaxes
defines the behavior, and not the incoming axis order. Since behavior version 6, this is already the case.n_out (int|None)
out_dim (Dim|None)
- classmethod get_out_data_from_opts(name, axes, keep_order=<class 'returnn.util.basic.NotSpecified'>, sources=(), n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_dim=None, **kwargs)[source]¶
- Parameters:
name (str)
axes (Sequence[Dim|str])
keep_order (bool|NotSpecified)
sources (list[LayerBase])
n_out (int|None|NotSpecified)
out_type (None|dict[str])
out_dim (Dim|None)
- Return type:
Data
- class returnn.tf.layers.basic.SplitLayer(axis=None, num_splits=None, size_splits=None, out_dims=None, **kwargs)[source]¶
Splits one axis into multiple parts, via tf.split. self.output is simply the input copied. Each part can be accessed via the sublayers “/%i”.
- Parameters:
axis (str|None) – feature axis by default
num_splits (int|None)
size_splits (list[int]|None)
out_dims (list[Dim]|None)
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- classmethod get_out_data_from_opts(sources, **kwargs)[source]¶
- Parameters:
sources (list[LayerBase])
- Return type:
Data
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- class returnn.tf.layers.basic.SplitDimsLayer(axis, dims, pad_to_multiples=None, pad_value=0, **kwargs)[source]¶
Splits one axis into multiple axes. E.g. if you know that your feature-dim is composed by a window, i.e. the input is (batch, time, window * feature), you can set axis=”F”, dims=(window, -1), and you will get the output (batch, time, window, feature).
If the split axis has a dynamic length, exactly one of the axes that we split into need to also have a dynamic length. You can e.g. use this to split the input dimension into smaller “chunks” of a fixed window size. E.g. you could have input (batch, time, feature) and set axis=”T”, dims=(-1, window), to get output (batch, split_time, window, feature). In this case, the exact sequence lengths are lost and everything is padded to multiples of the window size using the given padding value. Use
ReinterpretDataLayer
to receive back the original sequence lengths after merging.Also see
SplitBatchTimeLayer
. Also seeMergeDimsLayer
which can undo this operation.- Parameters:
axis (Dim|str) – e.g. “F”
dims (tuple[Dim|int]|list[Dim|int]) – what the axis should be split into. e.g. (window, -1)
pad_to_multiples (bool|None) – If true, input will be padded to the next multiple of the product of the static dims, such that splitting is actually possible. By default this is done iff the axis has a dynamic size
pad_value (int|float) – What pad value to use for pad_to_multiples
- class returnn.tf.layers.basic.SplitBatchTimeLayer(base, **kwargs)[source]¶
A very specific layer which expects to get input of shape (batch * time, …) and converts it into (batch, time, …), where it recovers the seq-lens from some other layer. See
SplitDimsLayer
for a more generic layer.- Parameters:
base (LayerBase) – used to recover the seq-lens
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ReshapeLayer(in_dims, out_dims, extra_deps=(), **kwargs)[source]¶
Allows to reshape (…, in_dims, …) to (…, out_dims, …) as long as prod(in_dims) == prod(out_dims).
in_dims don’t need to be directly behind each other or in that order – internally it will permute it such that it is in the right order. out_dims should be defined.
This can be used for clever indexing, slicing, padding tricks. It can also be used as an alternative to
SplitDimsLayer
orMergeDimsLayer
.- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.FlattenBatchLayer(axis='T', batch_major=True, **kwargs)[source]¶
Merges one axis into the batch axis. If the axis has dynamic lengths, this would use flattening, i.e. recalculate the padding, i.e. the size changes. This basically wraps
flatten_with_seq_len_mask()
orflatten_with_seq_len_mask_time_major()
. See alsoMergeDimsLayer
, which does not do flattening, i.e. the size stays the same.- Parameters:
axis (str)
batch_major (bool) – if False, will flatten in time-major manner
- class returnn.tf.layers.basic.UnflattenBatchLayer(**kwargs)[source]¶
Inverse of
FlattenBatchLayer
, so recovers an axis previously merged into the batch axisThis basically wraps
unflatten_with_seq_len_mask()
.- Parameters:
in_dim (Dim|None)
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None)
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see
Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons
- class returnn.tf.layers.basic.UnflattenNdLayer(sizes, num_axes, in_dim='T', out_dims=None, declare_same_sizes_as=None, **kwargs)[source]¶
This keeps the batch axis as-is, i.e. the flattening/unflattening did not happen on the batch axis.
Example:
Assumes that the input is of shape (B,T,<Ds>) which represents flattened images, where each image is of size width * height. We additionally provide these image sizes (shape (B,2)), i.e. (width,height) tuples. We return the unflattened images of shape (B,W,H,<Ds>), where W/H are the max width/height.
This basically wraps
returnn.tf.util.basic.unflatten_nd()
.- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ExpandDimsLayer(axis, dim=1, **kwargs)[source]¶
Adds some axis.
- Parameters:
axis (str|int) – axis to add, e.g. “F”|”feature” or “spatial”|”time”|”T”. if this is an integer, the input data is first converted into batch-major mode, and then this is counted with batch-dim.
dim (int|Dim) – dimension of new axis (1 by default)
- class returnn.tf.layers.basic.RepeatLayer(repetitions, axis='T', out_dim=None, **kwargs)[source]¶
A wrapper around tf.repeat, but supports an additional batch axis for the durations The sum of the repetitions has to be non-zero for each sequence in the batch.
This layer can only be used with Tensorflow 1.15.0 or newer.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.TileLayer(multiples, out_dims=None, **kwargs)[source]¶
A wrapper around tf.tile
- Parameters:
- class returnn.tf.layers.basic.CastLayer(dtype, output, **kwargs)[source]¶
Cast to some other dtype.
- Parameters:
dtype (str)
output (Data)
- class returnn.tf.layers.basic.SwapAxesLayer(axis1, axis2, **kwargs)[source]¶
Swaps two axes. Basically a wrapper around
returnn.tf.util.basic.swapaxes()
. Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.All axes always have a special meaning (e.g. feature dim or time dim) or dimension tag (e.g. for time axes, including dyn seq lengths). If you need to change the meaning (and not actually transpose / swap axes), you need to use
ReinterpretDataLayer
.See also
TransposeLayer
for a more generic variant.See also
ReinterpretDataLayer
, which does not swap/transpose axes, but allows to reinterpret their meaning / dim tags.- Parameters:
axis1 (int|str)
axis2 (int|str)
- class returnn.tf.layers.basic.TransposeLayer(perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim], **kwargs)[source]¶
Basically a wrapper around
tf.transpose()
.Note that usually, this should not be needed, and it is recommended not to be used, as this will be unnecessarily inefficient. Normally, all RETURNN layers will automatically transpose the input data into whatever format they need.
All axes always have a special meaning (e.g. feature dim or time dim) or dimension tag (e.g. for time axes, including dyn seq lengths). If you need to change the meaning (and not actually transpose / swap axes), you need to use
ReinterpretDataLayer
.See also
ReinterpretDataLayer
, which does not transpose axes, but allows to reinterpret their meaning / dim tags.One valid use case is to use this for the final output layer, to make sure the output is in the correct format.
- Parameters:
perm – target axis -> source axis
- classmethod transpose(input_data: Tensor, perm: Dict[Dim | str | int, Dim | str] | Sequence[Dim], name: str | None = None) Tensor [source]¶
- Parameters:
input_data
perm
name
- Returns:
transposed data
- class returnn.tf.layers.basic.ReinterpretDataLayer(switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]¶
Acts like the
CopyLayer
but reinterprets the role of some axes or data.- Parameters:
switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes
size_base (LayerBase|None) – copy the size_placeholder from the given layer
batch_dim_base (LayerBase|None) – copy the batch dim from this layer
set_axes (dict[str,Dim|str|None]) – This can be used to overwrite the special axes like time_dim_axis or feature_dim_axis. For that, use keys “B”,”T” or “F”, and a value via
Data.get_axis_from_description()
.set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None) – axis -> new dim tag. assigns new dim tags. If the passed dim tag is yet undefined, this will not use same_dim_tags_as (declare_same_as) but create a new dim tag. This option is useful for generalized self attention (https://github.com/rwth-i6/returnn/issues/391).
enforce_batch_major (bool)
enforce_time_major (bool)
set_sparse (bool|None) – if bool, set sparse value to this
set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse
increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- classmethod get_out_data_from_opts(name, sources, switch_axes=None, size_base=None, batch_dim_base=None, set_axes=None, set_dim_tags=None, enforce_batch_major=False, enforce_time_major=False, set_sparse=None, set_sparse_dim=<class 'returnn.util.basic.NotSpecified'>, increase_sparse_dim=None, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
switch_axes (str|list[str]) – e.g. “bt” to switch batch and time axes
size_base (LayerBase|None) – similar as size_target
batch_dim_base (LayerBase|None)
set_axes (dict[str,Dim|str|None])
set_dim_tags (dict[str|Dim,Dim]|Sequence[Tuple[Dim,Dim]]|None)
enforce_batch_major (bool)
enforce_time_major (bool)
set_sparse (bool|None) – if bool, set sparse value to this
set_sparse_dim (Dim|int|None|NotSpecified) – set sparse dim to this. assumes that it is sparse
increase_sparse_dim (int|None) – add this to the dim. assumes that it is sparse
- class returnn.tf.layers.basic.ConvLayer(filter_size, padding, strides=1, dilation_rate=1, groups=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, with_bias=<class 'returnn.util.basic.NotSpecified'>, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, pad_seq_len_to_power=None, **kwargs)[source]¶
A generic convolution layer which supports 1D, 2D and 3D convolution. Pooling can be done in the separate “pool” layer.
- Parameters:
filter_size (Sequence[Dim]|Sequence[int]) – (width,), (height,width) or (depth,height,width) for 1D/2D/3D conv. The input data ndim must match, or you can add dimensions via input_expand_dims or input_add_feature_dim. It will automatically swap the batch-dim to the first axis of the input data.
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
strides (int|Sequence[int]) – strides for the spatial dims, i.e. length of this tuple should be the same as filter_size, or a single int.
dilation_rate (int|Sequence[int]) – dilation for the spatial dims
groups (int) – grouped convolution
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
auto_use_channel_first (bool|NotSpecified) – convert the input to NCHW or not
with_bias (bool|NotSpecified) – if True, will add a bias to the output features. True by default since behavior version 10.
activation (None|str) – if set, will apply this function at the end
filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter
filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)
bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias
use_time_mask (bool)
pad_seq_len_to_power (Optional[float]) – pad sequence length to power of given number to reduce number of different sequence lengths. See https://github.com/rwth-i6/returnn/issues/1450 and https://github.com/tensorflow/tensorflow/issues/62441.
- classmethod set_output_dim_tags(output, num_batch_dims, in_spatial_dims, out_spatial_dims, filter_size, strides, dilation_rate, padding)[source]¶
- classmethod transform_input(input_data, network, in_dim=None, in_spatial_dims=None, input_expand_dims=0, input_split_feature_dim=None, input_add_feature_dim=False, use_time_mask=False, mask_value: float = 0.0)[source]¶
- Parameters:
input_data (Data)
network (returnn.tf.network.TFNetwork)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
input_expand_dims (int) – number of spatial dims to add to the input
input_split_feature_dim (None|int) – if set, like input_add_feature_dim it will add a new feature dim which is of value input_split_feature_dim, and the original input feature dim will be divided by input_split_feature_dim, thus it must be a multiple of that value.
input_add_feature_dim (bool) – will add a dim at the end and use input-feature-dim == 1, and use the original input feature-dim as a spatial dim.
use_time_mask (bool)
mask_value – when
use_time_mask
is used, what value to use for the mask
- Returns:
(transformed input, num batch dims). all batch dims are at the front
- Return type:
(Data, int)
- classmethod get_input_placeholder_with_same_static_padding(input_data: Tensor, num_batch_dims: int, filter_size: Sequence[int], strides: Sequence[int], out_batch_feature_major: bool) Tensor [source]¶
Returns the placeholder of input_data with same_static padding applied to it.
- Parameters:
input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
filter_size
strides
out_batch_feature_major
- classmethod get_input_placeholder_with_int_padding(input_data: Tensor, *, num_batch_dims: int, out_batch_feature_major: bool, padding: int | Sequence[int], pad_value: float = 0.0) Tensor [source]¶
Returns the placeholder of input_data with same_static padding applied to it.
- Parameters:
input_data – [Batch…, Spatial…, Feature] or [Batch…, Feature, Spatial…]
num_batch_dims
out_batch_feature_major
padding
pad_value
- classmethod get_out_data_from_opts(name, sources, network, filter_size, padding, strides=1, dilation_rate=1, input_expand_dims=0, input_add_feature_dim=False, input_split_feature_dim=None, in_dim=None, in_spatial_dims=None, n_out=None, out_dim=None, out_spatial_dims=None, auto_use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
- Parameters:
name (str)
sources (Sequence[LayerBase])
network (returnn.tf.network.TFNetwork)
filter_size (Sequence[int|Dim])
padding (str|int|Sequence[int])
strides (int|Sequence[int])
dilation_rate (int|Sequence[int])
input_expand_dims (int) – number of dynamic dims to add to the input
input_add_feature_dim (bool)
input_split_feature_dim (None|int)
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
input_expand_dims – number of spatial dims to add to the input
auto_use_channel_first (bool|NotSpecified)
- Return type:
Data
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.PoolLayer(mode, pool_size, padding='VALID', dilation_rate=1, strides=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, use_time_mask=False, **kwargs)[source]¶
A generic N-D pooling layer. This would usually be done after a convolution for down-sampling.
- Parameters:
mode (str) – “max” or “avg”
pool_size (Sequence[int]) – shape of the window of each reduce
padding (str|int|Sequence[int]) – “same”, “valid” or “same_static”. “same_static” is calculated differently depending on whether an axis is static or dynamic. For static axes, “same_static” padding is the same as “same” padding, i.e. filter_size - 1 - (T + strides - 1) % strides. For dynamic axes, “same_static” calculates the total padding size as filter_size - 1, i.e. it is independent of the length T of the axis and the striding. For dynamic axes, to avoid skipping any frames on the right, we set left_padding = (filter_size - strides) // 2.
dilation_rate (Sequence[int]|int)
strides (Sequence[int]|int|None) – in contrast to tf.nn.pool, the default (if it is None) will be set to pool_size
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified) – if set, will transform input to NCHW format
use_time_mask (bool)
- classmethod get_out_data_from_opts(name, sources, network, pool_size, strides=None, dilation_rate=1, padding='VALID', in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, use_channel_first=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
pool_size (Sequence[int])
strides (Sequence[int]|int)
dilation_rate (int|Sequence[int])
padding (str|int|Sequence[int])
in_dim (Dim|None)
in_spatial_dims (Sequence[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (Sequence[Dim]|None)
use_channel_first (bool|NotSpecified)
- Return type:
Data
- class returnn.tf.layers.basic.DctLayer(type=2, n=None, norm=None, **kwargs)[source]¶
Layer to perform DCT Wraps
tf.signal.dct()
. For further documentation on the input arguments, refer to https://www.tensorflow.org/api_docs/python/tf/signal/dct- Parameters:
type (int) – DCT type to perform. Must be 1, 2, 3, or 4
n (int|None) – length of the transform
norm (str|None) – normalization to apply. Must be None or “ortho”
- class returnn.tf.layers.basic.TransposedConvLayer(filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, in_dim=None, in_spatial_dims=None, out_dim=None, out_spatial_dims=None, with_bias=True, activation=None, forward_weights_init='glorot_uniform', bias_init=0.0, filter=None, filter_perm=None, bias=None, use_time_mask=False, **kwargs)[source]¶
Transposed convolution, sometimes also called deconvolution. See
tf.nn.conv2d_transpose()
(currently we support 1D/2D).- Parameters:
filter_size (list[int])
strides (list[int]|None) – specifies the upscaling. by default, same as filter_size
padding (str) – “same” or “valid”
remove_padding (list[int]|int)
output_padding (list[int|None]|int|None)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
out_dim (Dim|None)
out_spatial_dims (list[Dim]|None)
with_bias (bool) – whether to add a bias. enabled by default.
activation (str|None)
forward_weights_init
bias_init
filter (LayerBase|None) – if given, will not create an own parameter, but use this as the filter
filter_perm (dict[str,str]|None) – transposes the filter (input filter as layer)
bias (LayerBase|None) – if given, will not create an own parameter, but use this as the bias
use_time_mask (bool)
- static deconv_output_length(input_length, filter_size, padding, output_padding=None, stride=0, dilation=1, out_dim=None)[source]¶
Determines output length of a transposed convolution given input length. Copied from conv_utils.deconv_output_length, adapted with simplification.
Also see
ConvLayer.calc_out_dim()
.- Parameters:
- Returns:
The output length (integer)
- Return type:
T
- classmethod get_out_data_from_opts(name, sources, network, filter_size, strides=None, padding='same', remove_padding=0, output_padding=None, n_out=None, out_dim=None, out_spatial_dims=None, in_dim=None, in_spatial_dims=None, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
network (returnn.tf.network.TFNetwork)
filter_size (list[int])
strides (list[int]|None)
padding (str)
remove_padding (list[int]|int)
output_padding (list[int|None]|int|None)
n_out (int|None) – number of outgoing features
out_dim (Dim|None)
out_spatial_dims (list[Dim]|None)
in_dim (Dim|None)
in_spatial_dims (list[Dim|str]|None)
- Return type:
Data
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ReduceLayer(mode, axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None, **kwargs)[source]¶
This reduces some axis by using e.g. “sum” or “max”. It’s basically a wrapper around tf.reduce_sum or tf.reduce_max.
- Parameters:
mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (Sequence[Dim|str]) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See
Data.get_axes_from_description()
.axis (Dim|str) – for compatibility, can be used instead of
axes
keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int|None) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.
- classmethod reduce(input_data, mode, axes=None, keep_dims=False, enforce_batch_dim_axis=None, use_time_mask=None)[source]¶
- Parameters:
input_data (Data)
mode (str) – “sum” or “max”, “argmin”, “min”, “argmax”, “mean”, “logsumexp”
axes (int|list[int]|str) – One axis or multiple axis to reduce. It accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”, and it is strongly recommended to use some of these symbolic names. See
Data.get_axes_from_description()
.keep_dims (bool) – if dimensions should be kept (will be 1)
enforce_batch_dim_axis (int) – will swap the batch-dim-axis of the input with the given axis. e.g. 0: will convert the input into batch-major format if not already like that. Note that this is still not enough in some cases, e.g. when the other axes are also not as expected. The strong recommendation is to use a symbolic axis description.
use_time_mask (bool) – if we reduce over the time-dim axis, use the seq len info. By default, in that case, it will be True.
- Return type:
tf.Tensor
- classmethod need_enforce_batch_dim_axis(axes)[source]¶
- Parameters:
axes (int|list[int]|str|Dim)
- Returns:
if any integer is in axes, thus we should have a fixed dimension layout
- Return type:
bool
- classmethod get_axes(axis, input_data)[source]¶
- Parameters:
axis – see self.__init__()
input_data (Data)
- Returns:
list of axes
- Return type:
list[int]
- classmethod get_out_data_from_opts(name, sources, mode='', axes=None, axis=None, keep_dims=False, enforce_batch_dim_axis=None, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
mode (str) – (default here “” because other code uses this function)
axes (str|list[str]|None)
axis (str|None)
keep_dims (bool)
enforce_batch_dim_axis (int|None)
- Return type:
Data
- class returnn.tf.layers.basic.ReduceOutLayer(mode, num_pieces, out_dim=None, **kwargs)[source]¶
Combination of
SplitDimsLayer
applied to the feature dim andReduceLayer
applied to the resulting feature dim. This can e.g. be used to do maxout.- Parameters:
mode (str) – “sum” or “max” or “mean”
num_pieces (int) – how many elements to reduce. The output dimension will be input.dim // num_pieces.
out_dim (Dim|None)
- class returnn.tf.layers.basic.SqueezeLayer(axis, enforce_batch_dim_axis=None, allow_no_op=False, **kwargs)[source]¶
Removes an axis with dimension 1. This is basically a wrapper around tf.squeeze.
- Parameters:
axis (Dim|int|list[int]|str) – one axis or multiple axis to squeeze. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”
enforce_batch_dim_axis (int|None)
allow_no_op (bool)
- class returnn.tf.layers.basic.StackLayer(axis=None, out_spatial_dim=None, **kwargs)[source]¶
Stacks multiple inputs together using
tf.stack()
. This creates a new dimension for the stack.For concatenation (in feature dimension), see
CopyLayer
.- Parameters:
axis (int|None) – new axis. If not given, will use Data.get_default_new_axis_for_dim_tag(<spatial>), i.e. some reasonable default for a new spatial axis.
out_spatial_dim (Dim|None)
- class returnn.tf.layers.basic.WeightedSumLayer(axes, padding=None, size=None, keep_dims=None, **kwargs)[source]¶
Calculates a weighted sum, either over a complete axis of fixed dimension, or over some window. Can also do that for multiple axes. The weights are a trainable parameter matrix. Similar would be to use
ElemwiseProdLayer
andReduceLayer
, or just aDotLayer
with aVariableLayer
. See alsoLinearLayer
.- Parameters:
axes (str|list[str]) – the axes to do the weighted-sum over
padding (str) – “valid” or “same”, in case of keep_dims=True
size (None|tuple[int]) – the kernel-size. if left away, the axes must be of fixed dimension, and we will use keep_dims=False, padding=”valid” by default. Otherwise, if given, you must also provide padding and keep_dims=True by default.
keep_dims (bool) – if False, the axes will be squeezed away. see also size.
- class returnn.tf.layers.basic.ElemwiseProdLayer(axes, size=None, **kwargs)[source]¶
Element-wise product in some axes. Microsoft calls this “static attention”, in Deep Conv. NN with Layer-wise Context Expansion and Attention (LACE). The matrix/tensor to be used for the product are given as a trainable parameter. See also
LinearLayer
.- Parameters:
axes (str|list[str]) – e.g. “spatial”, but all those axes must be of fixed dimension
size (tuple[int]) – for double-checking, you can explicitly provide the size
- class returnn.tf.layers.basic.PrefixInTimeLayer(axis='T', out_dim=None, prefix=0.0, repeat=1, size_base=None, **kwargs)[source]¶
Adds some prefix in time dimension. This is kind of the reverse of
SliceNdLayer
does. Also seePadLayer
for static dimensions. Also seePostfixInTimeLayer
.- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.PostfixInTimeLayer(axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]¶
Adds some postfix in time dimension. Also see
PrefixInTimeLayer
.- Parameters:
- classmethod get_out_data_from_opts(name, sources, axis='T', out_dim=None, postfix=0.0, repeat=1, **kwargs)[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.TimeChunkingLayer(chunk_size, chunk_step, axis='T', out_dim=None, **kwargs)[source]¶
Performs chunking in time. See
returnn.tf.native_op.chunk()
. See alsoWindowLayer
andTimeUnChunkingLayer
. It’s very similar toWindowLayer
, but we have this case more optimized, and also it modifies the batch dim. The output is of shape (chunk_size, n_batch * n_chunks, …).- Parameters:
- class returnn.tf.layers.basic.TimeUnChunkingLayer(chunking_layer, **kwargs)[source]¶
Performs chunking in time. See
TFNativeOp.chunk()
. SeeTimeChunkingLayer
.- Parameters:
chunking_layer (TimeChunkingLayer)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.DotLayer(reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, use_mask: bool = True, debug=False, **kwargs)[source]¶
This performs a dot-product of two sources. The underlying matmul expects shapes (shared…, I, J) * (shared…, J, K) -> (shared…, I, K). We say that J is the axis to be reduced, I is the var-dim of source 1, and K is the var-dim of source 2. I, J, K can also be multiple axes from the sources. The var-dims don’t need to exist. All other axes (shared…) are expected to match.
You should try to avoid having the same dims in both sources when they are not reduced such that you would end up having some dim twice in the output, e.g. (shared…, I, I). You should avoid this because the dim order should never matter (https://github.com/rwth-i6/returnn/wiki/RETURNN-principles). If you need to perform such an operation, you can use
ReinterpretDataLayer
to introduce a new dim tag.The reduce dim can also be the sparse dim of one of the sources. In this case, it behaves like
GatherLayer
.- Parameters:
reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool) – if var2=None, add dim=1 at the end
use_mask – If the reduction is over dynamic axes, to get the correct sum reduction, we need to apply masking to one of the inputs. This is done automatically. By disabling this flag, this would be disabled.
debug (bool) – will print debug shapes, etc.
- Earlier defaults:
red1=-1, red2=-2, var1=-2, var2=-1, add_var2_if_empty=True.
- However, these are bad, for multiple reasons, like using integers, but also in general.
See https://github.com/rwth-i6/returnn/issues/627 for details.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- classmethod get_out_data_from_opts(name, sources, reduce=<class 'returnn.util.basic.NotSpecified'>, red1=<class 'returnn.util.basic.NotSpecified'>, red2=<class 'returnn.util.basic.NotSpecified'>, var1=<class 'returnn.util.basic.NotSpecified'>, var2=<class 'returnn.util.basic.NotSpecified'>, add_var2_if_empty=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
reduce (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of both sources
red1 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of first source
red2 (str|Dim|tuple[str|Dim]|list[str|Dim]) – reduce axes of second source
var1 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of first source
var2 (str|Dim|tuple[str|Dim]|list[str|Dim]|None) – var axes of second source
add_var2_if_empty (bool)
- Return type:
Data
- class returnn.tf.layers.basic.ShiftAxisLayer(axis, amount, pad=True, pad_value=0, adjust_size_info=True, **kwargs)[source]¶
Shifts the dimensions in an axis around by slicing and optional padding. This layer may change the axis-dimension.
This name might be confusing. No axis will be shifted here. See
SwapAxesLayer
for that.Also see
SliceLayer
.- Parameters:
axis (str|Dim|int) – single axis to shift
amount (int) – number of elements to shift (<0 for left-shift, >0 for right-shift)
pad (bool) – preserve shape by padding
pad_value (int|float|bool) – padding value
adjust_size_info (bool) – whether to adjust the size_placeholder
- class returnn.tf.layers.basic.ResizeLayer(factor, axis, out_dim=None, kind='nn', fill_value=None, fill_dropout=None, **kwargs)[source]¶
Resizes the input, i.e. upsampling or downsampling. Supports different kinds, such as linear interpolation or nearest-neighbor.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)
- class returnn.tf.layers.basic.CombineDimsLayer(**kwargs)[source]¶
Combines multiple dimensions. See also
MergeDimsLayer
. This is deprecated in favor ofMergeDimsLayer
.- Parameters:
axes (int|list[int]|str) – one axis or multiple axis to reduce. this is counted with batch-dim, which by default is axis 0 (see enforce_batch_dim_axis). it also accepts the special tokens “B”|”batch”, “spatial”, “spatial_except_time”, or “F”|”feature”
- class returnn.tf.layers.basic.RemoveLayer(symbol, axis='T', out_dim=None, **kwargs)[source]¶
Currently, assumes sparse data, and removes a specific symbol from the data.
It is recommended to use
MaskedComputationLayer
in combination with e.g. a :class:CompareLayer` instead, as this provides more flexibility.- Parameters:
- class returnn.tf.layers.basic.CombineLayer(kind, sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, activation=None, with_bias=False, eval=None, eval_locals=None, eval_for_output_loss=False, **kwargs)[source]¶
Applies a binary operation, such as addition, to all sources while accumulating the partial results. In the first step, the binary operation is performed on the first two sources. After the first step, the previous results is always the left-hand operator.
Its basic working is similar to the reduce function used in functional programming. Also see
ActivationLayer
, orCompareLayer
.- Parameters:
kind (str) – currently accepted values are average, add, sub, mul, truediv, floordiv, mod, pow, maximum, minimum, logical_and, logical_or, squared_difference, or eval, or any function in the tf.math or tf namespace.
sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.
activation (str|None) – if provided, activation function to apply, e.g. “tanh” or “relu”
with_bias (bool) – if given, will add a trainable bias tensor
eval (str|callable) – for kind=”eval”, will eval this string. or function. see
_op_kind_eval()
eval_locals (dict[str]|None) – locals for eval
eval_for_output_loss (bool) – will do the same eval on layer.output_loss
- classmethod get_out_data_from_opts(network, sources, eval_locals=None, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, out_shape=None, **kwargs)[source]¶
- Parameters:
network (returnn.tf.network.TFNetwork)
sources (list[LayerBase])
eval_locals (dict[str]|None) – locals for eval, will also pass to out_type is out_type is a function
n_out (int|None|NotSpecified)
allow_broadcast_all_sources (bool|NotSpecified)
out_type (dict[str]|None|(()->Data))
out_shape (set[Dim|_MarkedDim]|tuple|list|None) – verifies the output shape (dim tags)
- Return type:
Data
- class returnn.tf.layers.basic.EvalLayer(eval, **kwargs)[source]¶
Evaluates some string. The
CombineLayer
provides this functionality, thus this is just a special case of it. Also seeActivationLayer
, orCompareLayer
.The output type is defined as a broadcasted extension of all sources. You can overwrite it by (partially) specifying out_type. out_type can also be a generic Python function, returning a Data instance.
- Parameters:
eval (str) – will eval this string. see
_op_kind_eval()
- class returnn.tf.layers.basic.CompareLayer(kind='equal', value=None, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
Compares element-wise the tokens of all input sequences among themselves and/or with a specified given value. The comparisons are performed in a chain according to the order in which they are listed.
Example:
{"class": "compare", "from": ["i1", "i2"], "value": val, "kind": "less"}
computes i1 < i2 < val and it is true only if the whole chain of operations is true. The final result is the logical “and” of all comparisons. Note that value is the last element to be compared to.
A common example usage is the end layer in a rec subnetwork to specify the stopping criterion, e.g. the last generated token is equal to the end-of-sentence token:
"output": {"class": "rec", "from": [], "unit": { . . . "end": {"class": "compare", "from": "output", "value": end_of_sentence_id} }, "target": "classes0"}
- Parameters:
kind (str) – which comparison operation to use, e.g. “equal”, “greater”, “less” or other supported TF comparison ops
value (float|int|None) – if specified, will also compare to this
allow_broadcast_all_sources (bool|NotSpecified) – allow broadcasting for all sources. e.g. shape [A] + [B] -> shape [A,B]. by default disabled, and there must be some source with all dims.
- classmethod get_out_data_from_opts(sources, allow_broadcast_all_sources=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, out_shape=None, **kwargs)[source]¶
- Parameters:
sources (list[LayerBase])
allow_broadcast_all_sources (bool|NotSpecified)
n_out (int|None|NotSpecified)
out_type (dict[str]|None)
out_shape (dict[str]|None)
- Return type:
Data
- class returnn.tf.layers.basic.SwitchLayer(condition, true_from, false_from, **kwargs)[source]¶
Wrapper around
tf.where()
(or more genericallyreturnn.tf.util.basic.where_bc()
), or statically choose a single source if the condition is a callable (…)->bool. (tf.cond
is not useful here, as the sources would have been already constructed and computed.)This layer is also useful for applying any kind of generic masking to the frames. E.g. one could have a layer called “mask” computing a boolean mask for the values stored in another layer “input”. Then use this layer with condition=”mask”, true_from=”input”, false_from=mask_value, to mask out all frames where the mask is false with the mask_value.
See also
CondLayer
. See alsoSeqLenMaskLayer
if you just want to mask using the sequence lengths.- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.CondLayer(condition, true_layer, false_layer, _condition_network=None, _true_layer_network=None, _false_layer_network=None, _extra_out=None, **kwargs)[source]¶
See also
SwitchLayer
, which usestf.where()
. Here, we use tf.cond instead. I.e. the condition has to be a scalar bool, and only the corresponding true/false branch is computed.true_layer
/false_layer
are layer dicts, which are in the same namescope as this layer, however, they are in the corresponding control flow context (tf.cond).You can use
SubnetworkLayer
inside to embed any more complex logic.There can be more than one output via sub-layers. Specifically, it will make all from
get_available_sub_layer_names()
available. InSubnetworkLayer
, that are all the output layers in the sub-network.- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer ((str)->LayerBase)
- classmethod get_out_data_from_opts(true_layer, false_layer, name, network, **kwargs)[source]¶
- Parameters:
true_layer (LayerBase|dict[str])
false_layer (LayerBase|dict[str])
name (str)
network (returnn.tf.network.TFNetwork)
- Return type:
Data
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- class returnn.tf.layers.basic.TopKLayer(axis, k, k_dim=None, sorted=True, **kwargs)[source]¶
Basically wraps tf.nn.top_k.
Directly returns the top_k values. The indices are accessible via the “indices” sub-layer.
For an input [B,D] with axis=D, the output and indices values are shape [B,K].
It’s somewhat similar to
ReduceLayer
with max and argmax. The axis dim is reduced and then a new dim for K is added.Axis can also cover multiple axes, such as [beam,classes]. In that cases, there is not a single “indices” sub-layer, but sub-layers “indices0” .. “indices{N-1}” corresponding to each axis, in the same order.
All other axes are treated as batch dims.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- get_sub_layer(layer_name)[source]¶
- Parameters:
layer_name (str) – sub layer name
- Return type:
LayerBase|None
- class returnn.tf.layers.basic.SearchSortedLayer(sorted_sequence, values, axis='T', side='left', **kwargs)[source]¶
Basically wraps
tf.searchsorted()
.Takes a tensor sorted_sequence that is sorted along one axis, and a tensor values. Will compute an output tensor with the same axes as values, where each entry is the index of the value within the sorted sequence. All (batch) axes of sorted_sequence except for the axis it is sorted along must be present in values.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- classmethod get_out_data_from_opts(sorted_sequence, values, axis, name, network, **kwargs)[source]¶
- Parameters:
sorted_sequence (LayerBase)
values (LayerBase) – search values
axis (str) – the axis along which sorted_sequence is sorted
name (str)
network (returnn.tf.network.TFNetwork)
- Return type:
Data
- class returnn.tf.layers.basic.SubnetworkLayer(subnetwork, _subnet, _output, concat_sources=True, load_on_init=None, dropout=0, dropout_noise_shape=None, _parent_layer_cache=None, _from=None, **kwargs)[source]¶
You can define a whole subnetwork as a single layer by this class.
The subnetwork will be specified by a
dict[str,dict[str]]
, just like a normal network is specified in the config.The
"output"
layer of the subnetwork will be the output of this subnetwork-layer.- With
concat_sources=True
(default), the input to this layer will be represented as the
"data:data"
or simply"data"
in the subnetwork,- otherwise with
concat_sources=False
, the input to this layer will be represented as
"data:input_layer_name"
and also"data:0"
to"data:<n-1>"
for n inputs, for each input, in the subnetwork. The first input will also be simply available as"data:data"
/``”data”`.
- Parameters:
subnetwork (dict[str,dict]) – subnetwork as dict (JSON content). must have an “output” layer-
concat_sources (bool) – if we concatenate all sources into one, like it is standard for most other layers
load_on_init (str|dict[str]|None) – if provided, for parameter initialization, we will load the given model file. see
CustomCheckpointLoader
.dropout (float) – will be applied if train_flag is set
dropout_noise_shape (tuple|list|dict|None)
_parent_layer_cache (dict[str,LayerBase]|None)
_subnet (returnn.tf.network.Subnetwork)
_output (LayerBase)
- classmethod get_out_data_from_opts(n_out=<class 'returnn.util.basic.NotSpecified'>, out_type=None, **kwargs)[source]¶
- Parameters:
n_out (int|None|NotSpecified)
out_type (dict[str]|None)
- Return type:
Data
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- classmethod cls_get_sub_network(name, network, layer_desc)[source]¶
- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
layer_desc (dict[str])
- Return type:
- get_sub_layer(layer_name)[source]¶
- Parameters:
layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
- Returns:
the sub_layer addressed in layer_name or None if no sub_layer exists
- Return type:
LayerBase|None
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- get_dep_layers()[source]¶
- Returns:
list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
- Return type:
list[LayerBase]
- Parameters:
key (int|str|None) – also the special key “*”
- Return type:
tf.Tensor|None
- classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, encapsulate=False, **kwargs)[source]¶
- Parameters:
batch_dim (tf.Tensor) – for this layer, might be with beam
rec_layer (returnn.tf.layers.rec.RecLayer)
encapsulate (bool)
- Return type:
dict[str,tf.Tensor]
- classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, encapsulate=False, **kwargs)[source]¶
- Parameters:
rec_layer (returnn.tf.layers.rec.RecLayer)
encapsulate (bool)
- Returns:
optional shapes for the tensors by get_rec_initial_extra_outputs
- Return type:
dict[str,tf.TensorShape]
- With
- class returnn.tf.layers.basic.TrainFlagLayer(**kwargs)[source]¶
Returns the train flag (bool scalar) of the current network.
Usually the arguments, when specified in the network dict, are going through
transform_config_dict()
, before they are passed to here. SeeTFNetwork.construct_from_dict()
.- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using
get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See
Data.verify_out_shape()
.sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via
transform_config_dict()
. Every layer can have one loss (of typeLoss
), or none loss. In the net dict, it is specified as a string. InTFNetwork
, all losses from all layers will be collected. That is whatTFUpdater.Updater
will use for training.reuse_params (ReuseParams|None) – if given, will opt reuse the params. see
self.var_creation_scope()
. See also thename_scope
option as an alternative.name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in
get_absolute_name_scope_prefix()
andTFNetwork.layer_creation_scope()
.param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see
returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a
RecLayer
, it triggers the explicit accumulation of all frames. Also see theneed_last
option.only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside
RecLayer
, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via
RecLayer
.encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,
and covered in functions like
get_rec_initial_extra_outputs()
, and the logic incls_get_sub_network()
will not be used.If False, the logic in
cls_get_sub_network()
will be used.collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by
set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the
IdentityLayer
with the optioncontrol_dependencies
.debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as
name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as
network
_src_common_search_choices (None|SearchChoices) – set via
SearchChoices.translate_to_common_search_beam()
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.GlobalTrainStepLayer(**kwargs)[source]¶
Returns the global train step (int64 scalar).
Usually the arguments, when specified in the network dict, are going through
transform_config_dict()
, before they are passed to here. SeeTFNetwork.construct_from_dict()
.- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using
get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See
Data.verify_out_shape()
.sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via
transform_config_dict()
. Every layer can have one loss (of typeLoss
), or none loss. In the net dict, it is specified as a string. InTFNetwork
, all losses from all layers will be collected. That is whatTFUpdater.Updater
will use for training.reuse_params (ReuseParams|None) – if given, will opt reuse the params. see
self.var_creation_scope()
. See also thename_scope
option as an alternative.name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in
get_absolute_name_scope_prefix()
andTFNetwork.layer_creation_scope()
.param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see
returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a
RecLayer
, it triggers the explicit accumulation of all frames. Also see theneed_last
option.only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside
RecLayer
, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via
RecLayer
.encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,
and covered in functions like
get_rec_initial_extra_outputs()
, and the logic incls_get_sub_network()
will not be used.If False, the logic in
cls_get_sub_network()
will be used.collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by
set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the
IdentityLayer
with the optioncontrol_dependencies
.debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as
name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as
network
_src_common_search_choices (None|SearchChoices) – set via
SearchChoices.translate_to_common_search_beam()
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.AccumulateMeanLayer(exp_average, axes='bt', initial_value=None, is_prob_distribution=None, **kwargs)[source]¶
Accumulates the mean of the input (in training) (over batch-dim and time-dim by default). It’s similar to
ReduceLayer
- Parameters:
exp_average (float) – momentum in exponential average calculation
axes (int|list[str]|str) – the axes to reduce. must contain batch and time.
initial_value (float) – how to initialize the variable which accumulates the mean
is_prob_distribution (bool) – if provided, better default for initial_value
- class returnn.tf.layers.basic.LossLayer(loss_, target_=None, use_error=False, **kwargs)[source]¶
This layers wraps a
Loss
calculation as a layer. I.e. the loss will be calculated and returned by the layer. But this loss will not be used as a loss by the updater. If you want to use it as a loss, you can use theAsIsLoss
, i.e. write"loss": "as_is"
.Note that the loss options for the wrapped loss need to be provided via
loss_opts_
, and it does not apply any reduce function.Note
The
LossLayer
might be deprecated in the future in favor of implementing the losses as actual layers.If you want to define a loss inside the network, it is recommended to define it explicitly. An example could be:
"se_loss": {"class": "eval", "eval": "(source(0) - source(1)) ** 2", "from": ["output", "data:classes"]}
Followed by an e.g. mean reduce if needed:
"mse_loss": {"class": "reduce", "mode": "mean", "axis": "F", "from": "se_loss"}
loss_
and related params have the postfix_
to distinguish them from the loss options, which are used by the network and updater for training. Some of these (e.g.loss_opts_
) are handled intransform_config_dict()
.- Parameters:
- get_sub_layer(layer_name)[source]¶
- Parameters:
layer_name (str) – sub layer name
- Return type:
LayerBase|None
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.ForcedAlignmentLayer(align_target, topology, input_type, blank_idx=-1, blank_included=False, **kwargs)[source]¶
Calculates a forced alignment, via Viterbi algorithm.
- Parameters:
align_target (LayerBase)
topology (str) – e.g. “ctc” or “rna” (RNA is CTC without label loop)
input_type (str) – “log_prob” or “prob”
blank_idx (int) – vocab index of the blank symbol
blank_included (bool) – whether blank token of the align target is included in the vocabulary
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.SparseSoftmaxCrossEntropyWithLogitsLayer(logits, targets, axis=None, **kwargs)[source]¶
This is a simple wrapper for tf.nn.sparse_softmax_cross_entropy_with_logits.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.CtcLossLayer(logits, targets, logits_normalized=False, blank_index=-1, max_approx=False, **kwargs)[source]¶
Calculates the CTC loss.
Internally, this uses
returnn.tf.native_op.ctc_loss()
which is equivalent to tf.nn.ctc_loss but more efficient.Output is of shape [B].
- Parameters:
logits (LayerBase) – (before softmax). shape [B,T,D]
targets (LayerBase) – sparse. shape [B,T]
logits_normalized (bool) – whether the logits are already normalized (e.g. via log-softmax)
blank_index (int) – vocab index of the blank symbol
max_approx (bool) – if True, use max instead of sum over alignments (max approx, Viterbi)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.FastBaumWelchLayer(align_target, align_target_key=None, ctc_opts=None, sprint_opts=None, input_type='log_prob', tdp_scale=1.0, am_scale=1.0, min_prob=0.0, staircase_seq_len_source=None, **kwargs)[source]¶
Calls
fast_baum_welch()
orfast_baum_welch_by_sprint_automata()
. We expect that our input are +log scores, e.g. use log-softmax.- Parameters:
align_target (str) – e.g. “sprint”, “ctc” or “staircase”
align_target_key (str|None) – e.g. “classes”, used for e.g. align_target “ctc”
ctc_opts (dict[str]) – used for align_target “ctc”
sprint_opts (dict[str]) – used for Sprint (RASR) for align_target “sprint”
input_type (str) – “log_prob” or “prob”
tdp_scale (float)
am_scale (float)
min_prob (float) – clips the minimum prob (value in [0,1])
staircase_seq_len_source (LayerBase|None)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.GradientLayer(y: LayerBase, x: LayerBase, **kwargs)[source]¶
Calculates the gradient of y w.r.t. x.
- Parameters:
y
x
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.SyntheticGradientLayer(gradient, meta_loss_scale=1.0, **kwargs)[source]¶
This is a generalized way to be able to replace the true gradient with any kind of predicted gradient. This enabled to implement the idea from here:
Decoupled Neural Interfaces using Synthetic Gradients, https://arxiv.org/abs/1608.05343
- Parameters:
gradient (LayerBase)
meta_loss_scale (float)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.TikhonovRegularizationLayer(meta_loss_scale=1.0, **kwargs)[source]¶
Adds the Tikhonov regularization as a meta-loss (see
returnn.tf.util.basic.MetaLosses
).- Parameters:
meta_loss_scale (float)
- class returnn.tf.layers.basic.FramewiseStatisticsLayer(sil_label_idx, histogram_num_bins=20, **kwargs)[source]¶
Collects various statistics (such as FER, etc) on the sources. The tensors will get stored in self.stats which will be collected by TFEngine.
Usually the arguments, when specified in the network dict, are going through
transform_config_dict()
, before they are passed to here. SeeTFNetwork.construct_from_dict()
.- Parameters:
name (str)
network (returnn.tf.network.TFNetwork)
output (Data) – Set a specific output instead of using
get_out_data_from_opts()
n_out (NotSpecified|None|int) – output dim
out_dim (returnn.tensor.Dim|None) – output feature dim tag
out_type (dict[str]) – kwargs for Data class. more explicit than n_out.
out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See
Data.verify_out_shape()
.sources (list[LayerBase]) – via self.transform_config_dict()
in_dim (returnn.tensor.Dim|None) – input feature dim tag
target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.
_target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer
size_target (str|None) – like target but this is only used to set our output size in case of training
loss (Loss|None) – via
transform_config_dict()
. Every layer can have one loss (of typeLoss
), or none loss. In the net dict, it is specified as a string. InTFNetwork
, all losses from all layers will be collected. That is whatTFUpdater.Updater
will use for training.reuse_params (ReuseParams|None) – if given, will opt reuse the params. see
self.var_creation_scope()
. See also thename_scope
option as an alternative.name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in
get_absolute_name_scope_prefix()
andTFNetwork.layer_creation_scope()
.param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h
L2 (float|None) – for constraints
darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468
spatial_smoothing (float|None) – see
returnn.tf.util.basic.spatial_smoothing_energy()
param_variational_noise (float|None) – adds variational noise to the params during training
param_dropout (float|None) – dropout on params (weight dropout) during training
param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.
updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …
is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a
RecLayer
, it triggers the explicit accumulation of all frames. Also see theneed_last
option.only_on_eval (bool) – if True, this layer will only be calculated in eval
only_on_search (bool) – if True, this layer will only be calculated when search is done
copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source
batch_norm (bool|dict) – see self.batch_norm()
initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()
state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)
need_last (bool) – Inside
RecLayer
, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via
RecLayer
.encapsulate (bool) –
mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,
and covered in functions like
get_rec_initial_extra_outputs()
, and the logic incls_get_sub_network()
will not be used.If False, the logic in
cls_get_sub_network()
will be used.collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers
trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.
custom_param_importer (str|callable|None) – used by
set_param_values_by_dict()
register_as_extern_data (str|None) – registers output in network.extern_data
control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the
IdentityLayer
with the optioncontrol_dependencies
.debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer
_name (str) – just for internal construction, should be the same as
name
_network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as
network
_src_common_search_choices (None|SearchChoices) – set via
SearchChoices.translate_to_common_search_beam()
- class returnn.tf.layers.basic.PrintLayer(summarize=99, extra_print_args=(), **kwargs)[source]¶
Prints the sources to console/log, via
returnn.tf.util.basic.py_print()
.- Parameters:
summarize (int|None) – passed to
py_print()
extra_print_args (list|tuple)
- class returnn.tf.layers.basic.HDFDumpLayer(filename, extra=None, dump_whole_batches=False, labels=None, extend_existing_file=False, dump_per_run=False, **kwargs)[source]¶
Dumps into HDF file, compatible to
HDFDataset
.The HDF will be written to disk under the specified filename, if there was no error, by default at graph reset, via
TFNetwork.register_graph_reset_callback()
. Or after the dataset iteration run loop, with dump_per_run, viaTFNetwork.register_run_finished_callback()
.Common usage would be to add this to your network with “is_output_layer”: True, such that you don’t need to make other layers depend on it.
It currently uses
SimpleHDFWriter
internally.- Parameters:
filename (str|(()->str))
extra (None|dict[str,LayerBase])
dump_whole_batches (bool) – dumps the whole batch as a single sequence into the HDF
labels (list[str]|None)
extend_existing_file (bool) – True also means we expect that it exists
dump_per_run (bool) – write via
TFNetwork.register_run_finished_callback()
- classmethod get_out_data_from_opts(name, sources, **kwargs)[source]¶
- Parameters:
name (str)
sources (list[LayerBase])
- Return type:
Data
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.ImageSummaryLayer(max_outputs=3, **kwargs)[source]¶
Creates image summaries which can be viewed in TensorBoard. This layer expects the source to be in (T-decoder, T-encoder, B, 1).
- Parameters:
max_outputs – number of images to generate per step
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace, the loss_opts
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.CrossEntropyLoss(input_type='prob', focal_loss_factor=0.0, label_smoothing=0.0, label_smoothing_gaussian=False, debug_dump=False, safe_log_opts=None, use_fused=True, fake_upper_bound=None, **kwargs)[source]¶
Cross-Entropy loss. Basically sum(target * log(output)).
- Parameters:
input_type (str) – “prob” (default) or “logits”
focal_loss_factor (float) – see https://arxiv.org/abs/1708.02002. 0 means disabled
label_smoothing (float) – 0.1 is a common default. see
returnn.tf.util.basic.smoothing_cross_entropy()
label_smoothing_gaussian (bool) – see
returnn.tf.util.basic.smoothing_cross_entropy()
debug_dump (bool)
safe_log_opts (dict[str]) – passed to
safe_log()
use_fused (bool) – if possible, use fused opts
fake_upper_bound (float|None) – uses
returnn.tf.util.basic.minimum_with_identity_grad()
. I.e. you will see a finite loss, but we use the original gradient (which should be safe).
- class returnn.tf.layers.basic.BinaryCrossEntropyLoss(pos_weight=None, **kwargs)[source]¶
Binary cross entropy. We expect the output as logits, not in probability space! Per frame: mean(target * log(sigmoid(output)) + (1 - target) * log(1 - sigmoid(output)))
- Parameters:
pos_weight (float|None) – weight of positive labels, see tf.nn.weighted_cross_entropy_with_logits.
- class returnn.tf.layers.basic.GenericCELoss(**kwargs)[source]¶
Some generalization of cross entropy.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- class returnn.tf.layers.basic.CtcLoss(target_collapse_repeated=False, auto_clip_target_len=False, output_in_log_space=False, beam_width=100, ctc_opts=None, use_native=False, use_viterbi=False, **kwargs)[source]¶
Connectionist Temporal Classification (CTC) loss. Basically a wrapper around tf.nn.ctc_loss.
- Parameters:
target_collapse_repeated (bool) – like preprocess_collapse_repeated option for CTC. used for sparse_labels().
auto_clip_target_len (bool) – see self._get_target_sparse_labels().
output_in_log_space (bool) – False -> output expected in prob space. see self.get_output_logits
beam_width (int) – used in eval
ctc_opts (dict[str]|None) – other kwargs used for tf.nn.ctc_loss
use_native (bool) – use our native implementation (
TFNativeOp.ctc_loss()
)use_viterbi (bool) – instead of full-sum, use only best path (via
ctc_loss_viterbi()
)
- class returnn.tf.layers.basic.EditDistanceLoss(debug_print=False, label_map=None, ctc_decode=False, output_in_log_space=False, **kwargs)[source]¶
Note that this loss is not differentiable, thus it’s only for keeping statistics.
- Parameters:
debug_print (bool) – will tf.Print the sequence
label_map (dict[int,int]|None) – before calculating the edit-distance, will apply this map
ctc_decode (bool) – True -> expects dense output and does CTC decode, False -> expects sparse labels in output
output_in_log_space (bool) – False -> dense output expected in prob space. see self.get_output_logits
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
- class returnn.tf.layers.basic.BleuLoss(**kwargs)[source]¶
Note that this loss is not differentiable, thus it’s only for keeping statistics. Also, BLEU is a score, i.e. the higher, the better. Thus, to interpret it as a loss or error, we take the negative value.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
- class returnn.tf.layers.basic.ExpectedLoss(loss, loss_kind, norm_scores=True, norm_scores_stop_gradient=True, divide_beam_size=True, subtract_average_loss=True, loss_correction_grad_only=False, **kwargs)[source]¶
This loss uses another loss error or value and given the search beam scores, calculates the expected loss. Sometimes also called minimum Bayes risk.
- Parameters:
loss (Loss)
loss_kind (str) – “error” or “value”. whether to use loss.get_error() or loss.get_value()
norm_scores (bool)
norm_scores_stop_gradient (bool)
divide_beam_size (bool)
subtract_average_loss (bool)
loss_correction_grad_only (bool)
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- class returnn.tf.layers.basic.DeepClusteringLoss(embedding_dimension, nr_of_sources, **kwargs)[source]¶
Cost function used for deep clustering as described in [Hershey & Chen+, 2016]: “Deep clustering discriminative embeddings for segmentation and separation”
- Parameters:
embedding_dimension (int)
nr_of_sources (int)
- class returnn.tf.layers.basic.L1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
L1-distance loss. sum(target - output).
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- class returnn.tf.layers.basic.MeanSquaredError(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
The generic mean squared error loss function
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- class returnn.tf.layers.basic.MeanL1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
Like MSE loss, but with absolute difference
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- class returnn.tf.layers.basic.ExternSprintLoss(sprint_opts, **kwargs)[source]¶
The loss is calculated by an extern Sprint instance.
- Parameters:
sprint_opts (dict[str])
- class returnn.tf.layers.basic.FastBaumWelchLoss(sprint_opts, tdp_scale=1.0, **kwargs)[source]¶
The loss is calculated via
fast_baum_welch()
. The automata are created by an extern Sprint instance.- Parameters:
sprint_opts (dict[str])
- class returnn.tf.layers.basic.ViaLayerLoss(error_signal_layer=None, align_layer=None, loss_wrt_to_act_in=False, **kwargs)[source]¶
The loss error signal and loss value is defined as the output of another layer. That way, you can define any custom loss. This could e.g. be used together with the fast_bw layer.
This is a more custom variant of
AsIsLoss
, which simply takes the output of a layer as loss without redefining the error signal (gradient).- Parameters:
error_signal_layer (LayerBase)
align_layer (LayerBase)
loss_wrt_to_act_in (bool|str) – if True, we expect that the given output_with_activation is set, and the given error signal is w.r.t. the input of the specific activation function. A common example is the input to the softmax function, where the gradient is much more stable to define, e.g. y - z instead of y/z for cross entropy. If you specify a str, e.g. “softmax” or “log_softmax”, there is an additional check that the used activation function is really that one.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace, the loss_opts
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- class returnn.tf.layers.basic.AsIsLoss(as_error=False, **kwargs)[source]¶
Use the output as-is as the loss.
Also see
ViaLayerLoss
which also allows to define a custom error signal (gradient).- Parameters:
as_error (bool) – if True, use the output as error, otherwise (default) use the output as loss value. Error is purely for reporting, loss value is used for the optimizer as well (when scale != 0).
- class returnn.tf.layers.basic.SearchScoreLoss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
Use the scores from
SearchChoices
.- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- class returnn.tf.layers.basic.SamplingBasedLoss(num_sampled=128, num_splits=1, sampler='log_uniform', nce_loss=False, use_full_softmax=False, remove_accidental_hits=None, sampler_args=None, nce_log_norm_term=0.0, **kwargs)[source]¶
Implement two sampling based losses, sampled softmax (default) and noise contrastive estimation. https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss. https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.
Must be used in an output linear layer with a weight matrix of shape (num_classes, dim). When using ‘log_uniform’ sampler (default), optimal performance is typically achieved with the vocabulary list sorted in decreasing order of frequency (https://www.tensorflow.org/api_docs/python/tf/random/log_uniform_candidate_sampler).
- Parameters:
num_sampled (int) – Number of classes to be sampled. For sampled softmax, this is the number of classes to be used to estimate the sampled softmax. For noise contrastive estimation, this is the number of noise samples.
num_splits (int) – Number of different samples (each with ‘num_sampled’ classes) to be used per batch.
sampler (str) – Specify sampling distribution (“uniform”, “log_uniform”, “learned_unigram” or “fixed_unigram”).
nce_loss (bool) – If True, use noise contrastive estimation loss. Else (default), use the sampled softmax.
use_full_softmax (bool) – If True, compute the full softmax instead of sampling (can be used for evaluation).
remove_accidental_hits (bool|None) – If True, remove sampled classes that equal one of the target classes. If not specified (None), the value is determined based on the choosen objective. For sampled softmax this should be set to True; for NCE the default is False. Set this to True in case of NCE training and the objective is equal to sampled logistic loss.
sampler_args (dict[str]) – additional arguments for the candidate sampler. This is most relevant to the fixed_unigram sampler. See https://www.tensorflow.org/api_docs/python/tf/random/fixed_unigram_candidate_sampler for details.
nce_log_norm_term (float) – The logarithm of the constant normalization term for NCE.
- class returnn.tf.layers.basic.TripletLoss(margin, multi_view_training=False, **kwargs)[source]¶
Triplet loss: loss = max(margin + d(x_a, x_s) - d(x_a, x_d), 0.0) Triplet loss is used for metric learning in a siamese/triplet network. It should be used as a part of CopyLayer with 3 inputs corresponding to
x_a, x_s and x_d in a loss.
- Here we assume that x_a are anchor samples, x_s are samples where
at each position i in a minibatch x_ai and x_si belong to the same class, while pairs x_ai and x_di belong to different classes.
In this implementation the number of training examples is increased by extracting all possible same/different pairs within a minibatch.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
- returnn.tf.layers.basic.auto_register_layer_classes(vars_values)[source]¶
Example usage:
from returnn.tf.layers.basic import auto_register_layer_classes auto_register_layer_classes('extern_private/your_stuff/CoolThingy.py')
- Parameters:
vars_values (list|types.ModuleType|str) – e.g. use list(globals().values()). str is considered as a module-filename
- Returns:
nothing
- returnn.tf.layers.basic.register_layer_class(layer_class)[source]¶
Registers a layer class such that it can be used in network construction.
- Parameters:
layer_class (type[LayerBase])
- Returns:
nothing