Recurrent Layers

Choice Layer

class returnn.tf.layers.rec.ChoiceLayer(beam_size, keep_beams=False, search=<class 'returnn.util.basic.NotSpecified'>, add_to_beam_scores=<class 'returnn.util.basic.NotSpecified'>, input_type='prob', prob_scale=1.0, base_beam_score_scale=1.0, random_sample_scale=0.0, length_normalization=True, length_normalization_exponent=1.0, custom_score_combine=None, source_beam_sizes=None, scheduled_sampling=False, cheating=False, explicit_search_sources=None, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

In case of multiple sources, this layer computes the top-k combinations of choices. The score of such a combination is determined by adding up the (log-space) scores of the choices for the individual sources. In this case, the ‘target’ parameter of the layer has to be set to a list of targets corresponding to the sources respectively. Because computing all possible combinations of source scores is costly, the sources are pruned beforehand using the beam sizes set by the ‘source_beam_sizes’ parameter. The choices made for the different sources can be accessed via the sublayers ‘<choice layer name>/out_0’, ‘<choice layer name>/out_1’ and so on. Note, that the way scores are combined assumes the sources to be independent. If you want to model a dependency, use separate ChoiceLayers and let the input of one depend on the output of the other.

Parameters:
  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)

  • keep_beams (bool) – specifies that we keep the beam_in entries, i.e. we just expand, i.e. we just search on the dim. beam_size must be a multiple of beam_in.

  • search (NotSpecified|bool) – whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.

  • add_to_beam_scores (NotSpecified|bool) – whether to add the scores to the beam scores. This will be done with search obviously (not supported to not do it). Without search, we can still add the scores of the ground-truth labels to the beam. By default, this is derived from search or network.search_flag. So with enabled net search flag, even when search is disabled here, it will add the scores.

  • input_type (str) – “prob”, “log_prob” or “logits”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is. If there are several inputs, same format for all is assumed.

  • prob_scale (float) – factor for prob (score in +log space from source)

  • base_beam_score_scale (float) – factor for beam base score (i.e. prev prob scores)

  • random_sample_scale (float) – if >0, will add Gumbel scores. you might want to set base_beam_score_scale=0

  • length_normalization (bool) – evaluates score_t/len in search

  • source_beam_sizes (list[int]|None) – If there are several sources, they are pruned with these beam sizes before combination. If None, ‘beam_size’ is used for all sources. Has to have same length as number of sources.

  • scheduled_sampling (dict|None)

  • cheating (bool|str) – if True, will always add the true target in the beam. if “exclusive”, enables cheating_exclusive. see returnn.tf.util.basic.beam_search().

  • explicit_search_sources (list[LayerBase]|None) – will mark it as an additional dependency. You might use these also in custom_score_combine.

  • custom_score_combine (callable|None)

layer_class: Optional[str] = 'choice'[source]
search_choices: Optional[SearchChoices][source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, target, network, beam_size, search=<class 'returnn.util.basic.NotSpecified'>, scheduled_sampling=False, cheating=False, **kwargs)[source]
Parameters:
Return type:

Data

get_sub_layer(layer_name)[source]

Used to get outputs in case of multiple targets. For all targets we create a sub-layer that can be referred to as “self.name + ‘/out_’ + index” (e.g. output/out_0). These sub-layers can then be used as input to other layers, e.g. “output_0”: {“class”: “copy”, “from”: [“output/out_0”].

Parameters:

layer_name (str) – name of the sub_layer (e.g. ‘out_0’)

Returns:

internal layer that outputs labels for the target corresponding to layer_name

Return type:

InternalLayer|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (e.g. ‘out_0’), see self.get_sub_layer()

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_rec_initial_output(batch_dim, name, output, rec_layer, initial_output=None, **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search

  • name (str) – layer name

  • output (Data) – template

  • rec_layer (returnn.tf.layers.rec.RecLayer)

  • initial_output (str|float|int|tf.Tensor|None)

Return type:

tf.Tensor

post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]
Parameters:
  • rec_vars_outputs (dict[str,tf.Tensor])

  • seq_len (tf.Tensor) – shape (batch,)

Return type:

dict[str,tf.Tensor]

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Choice Get Beam Scores Layer

class returnn.tf.layers.rec.ChoiceGetBeamScoresLayer(**kwargs)[source]

Gets beam scores from SearchChoices. This requires that the source has search choices.

Note

This layer might be deprecated in the future.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'choice_get_beam_scores'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Decide Layer

class returnn.tf.layers.rec.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters:

length_normalization (bool) – performed on the beam scores

layer_class: Optional[str] = 'decide'[source]
search_choices: Optional[SearchChoices][source]
classmethod cls_get_search_beam_size(sources, **kwargs)[source]
Parameters:

sources (list[LayerBase])

Return type:

int|None

classmethod decide(src, output=None, owner=None, name=None, length_normalization=False)[source]
Parameters:
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)

  • output (Data|None)

  • owner (LayerBase|None)

  • name (str|None)

  • length_normalization (bool) – performed on the beam scores

Returns:

best beam selected from input, e.g. shape (batch, time, dim)

Return type:

(Data, SearchChoices|None)

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Get Accumulated Output Layer

class returnn.tf.layers.rec.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Note that this functionality is obsolete now. You can simply access such an sub layer via the generic sub layer access mechanism. I.e. instead of:

"sub_layer": {"class": "get_rec_accumulated", "from": "rec_layer", "sub_layer": "hidden"}

You can do:

"sub_layer": {"class": "copy", "from": "rec_layer/hidden"}
Parameters:

sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True

layer_class: Optional[str] = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • sub_layer (str)

Return type:

Data

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Get Last Hidden State Layer

class returnn.tf.layers.rec.GetLastHiddenStateLayer(out_dim=None, n_out=None, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters:
  • out_dim (Dim|None)

  • n_out (int|None) – dimension. output will be of shape (batch, n_out)

  • combine (str) – “concat” or “add”

  • key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()

layer_class: Optional[str] = 'get_last_hidden_state'[source]
get_last_hidden_state(key)[source]
Parameters:

key (str|None)

Return type:

tf.Tensor

classmethod get_out_data_from_opts(name, sources, out_dim=None, n_out=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • out_dim (Dim|None)

  • n_out (int|None) – dimension. output will be of shape (batch, n_out)

Return type:

Data

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Masked Computation Layer

class returnn.tf.layers.rec.MaskedComputationLayer(mask, unit, masked_from, _layer_class, _layer_desc, in_spatial_dim=None, out_spatial_dim=None, _queried_sub_layers=None, _parent_layer_cache=None, **kwargs)[source]

Given some input [B,T,D] and some mask [B,T] (True or False), we want to perform a computation only on the masked frames. I.e. let T’ be the max seq len of the masked seq, then the masked input would be [B,T’,D]. (This masked input sequence could be calculated via tf.boolean_mask or tf.gather_nd.) The output is [B,T’,D’], i.e. we do not undo the masking. You are supposed to use UnmaskLayer to undo the masking.

The computation also works within a rec layer, i.e. the input is just [B,D] and the mask is just [B]. In that case, if the mask is True, it will perform the computation as normal, and if it is False, it will just copy the prev output, and also hidden state.

Parameters:
  • mask (LayerBase|None)

  • unit (dict[str])

  • masked_from (LayerBase|None)

  • in_spatial_dim (Dim|None)

  • out_spatial_dim (Dim|None) – the masked dim

  • _layer_class (type[LayerBase])

  • _layer_desc (dict[str])

  • _queried_sub_layers (dict[str,(Data,type,dict[str])]|None)

  • _parent_layer_cache (dict[str,LayerBase]|None)

layer_class: Optional[str] = 'masked_computation'[source]
recurrent = True[source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
params: Dict[str, tf.Variable][source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(network, masked_from=None, in_spatial_dim=None, out_spatial_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

Returns:

the sub_layer addressed in layer_name or None if no sub_layer exists

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_constraints_value()[source]
Return type:

tf.Tensor|None

classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name

  • network (returnn.tf.network.TFNetwork)

  • loss (Loss|None) – argument just as for __init__

  • output (Data) – the output (template) for the layer

  • layer (LayerBase|None)

  • reduce_func (((tf.Tensor)->tf.Tensor)|None)

  • kwargs – other layer kwargs

Return type:

list[returnn.tf.network.LossHolder]

classmethod get_rec_initial_output(initial_output=None, **kwargs)[source]
Parameters:

initial_output

Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, **kwargs)[source]
Parameters:

rec_layer (returnn.tf.layers.rec.RecLayer)

Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Unmasking Layer

class returnn.tf.layers.rec.UnmaskLayer(mask, **kwargs)[source]

This is meant to be used together with MaskedComputationLayer, which operates on input [B,T,D], and given a mask, returns [B,T’,D’]. This layer UnmaskLayer is supposed to undo the masking, i.e. to recover the original time dimension, i.e. given [B,T’,D’], we output [B,T,D’]. This is done by repeating the output for the non-masked frames, via the last masked frame.

If this layer is inside a recurrent loop, i.e. we get [B,D’] as input, this is a no-op, and we just return the input as is. In that case, the repetition logic is handled via MaskedComputationLayer.

Parameters:

mask (LayerBase) – the same as as used for MaskedComputationLayer. Outside loop: [B,T] or [T,B], original T. Inside loop, just [B].

layer_class: Optional[str] = 'unmask'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, network, sources, mask, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Recurrent Layer

class returnn.tf.layers.rec.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, max_seq_len_via=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, back_prop=None, use_global_rec_step_offset=False, include_eos=False, debug=None, axis=None, in_dim=None, out_dim=None, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM Benchmark (https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

  • BasicLSTM (the cell), via official TF, pure TF implementation

  • LSTMBlock (the cell), via tf.contrib.rnn.

  • LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM

  • CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.

  • NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.

  • NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
    "class": "rec",
    "from": "input",
    "unit": {
      # Recurrent subnet here, operate on a single time-step:
      "output": {
        "class": "linear",
        "from": ["prev:output", "data:source"],
        "activation": "relu",
        "n_out": n_out},
    },
    "n_out": n_out},
}

More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

This layer can also be inside another RecLayer. In that case, it behaves similar to RnnCellLayer. (This support is somewhat incomplete yet. It should work for the native units such as NativeLstm.)

Also see Recurrency.

Parameters:
  • unit (str|_SubnetworkRecCell) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps. The subnetwork is specified as a net dict in the config.

  • unit_opts (None|dict[str]) – passed to RNNCell creation

  • direction (int|None) – None|1 -> forward, -1 -> backward

  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim

  • initial_state (LayerBase|str|float|int|tuple|None)

  • max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code

  • max_seq_len_via (LayerBase|None) – like max_seq_len but via another layer

  • forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • recurrent_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • bias_init (str) – see returnn.tf.util.basic.get_initializer()

  • optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible

  • cheating (bool) – Unused, is now part of ChoiceLayer

  • unroll (bool) – if possible, unroll the loop (implementation detail)

  • back_prop (bool|None) – for tf.while_loop. the default will use self.network.train_flag

  • use_global_rec_step_offset (bool)

  • include_eos (bool) – for search, whether we should include the frame where “end” is True

  • debug (bool|None)

  • axis (Dim|str) – specify the axis to iterate over. It can also be the special marker single_step_dim, or an outer recurrent time dim.

  • in_dim (Dim|None)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'rec'[source]
recurrent = True[source]
SubnetworkRecCell[source]

alias of _SubnetworkRecCell

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_source_and_axis(network, source_data=None, have_dyn_seq_len_end=False, axis=None, opts=None)[source]
Parameters:
Return type:

(Data|None, Dim)

classmethod transform_config_dict(d, network, get_layer)[source]

This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process).

Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, network, sources, unit, axis=None, in_dim=None, out_dim=None, initial_state=None, **kwargs)[source]
Parameters:
Return type:

Data

get_absolute_name_scope_prefix()[source]
Return type:

str

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:

dict[str,tf.Tensor|tuple[tf.Tensor]]

classmethod get_rec_initial_output(**kwargs)[source]
Return type:

tf.Tensor

classmethod get_rnn_cell_class(name, cell_only=False)[source]
Parameters:
  • name (str|type) – cell name, minus the “Cell” at the end

  • cell_only (bool) – i.e. for single-step execution

Return type:

type[rnn_cell.RNNCell]|type[returnn.tf.native_op.RecSeqCellOp]

classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name

  • network (returnn.tf.network.TFNetwork)

  • loss (Loss|None) – argument just as for __init__

  • output (Data) – the output (template) for the layer

  • reduce_func (((tf.Tensor)->tf.Tensor)|None)

  • layer (LayerBase|None)

  • kwargs – other layer kwargs

Return type:

list[returnn.tf.network.LossHolder]

get_constraints_value()[source]
Return type:

tf.Tensor

static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters:
  • reader (tf.train.CheckpointReader)

  • prefix (str) – e.g. “layer2/rec/”

  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”

Returns:

dict key -> value, {”…/kernel”: …, “…/bias”: …} with prefix

Return type:

dict[str,numpy.ndarray]

get_last_hidden_state(key)[source]
Parameters:

key (str|int|None)

Return type:

tf.Tensor

classmethod is_prev_step_layer(layer)[source]
Parameters:

layer (LayerBase)

Return type:

bool

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

Returns:

the sub_layer addressed in layer_name or None if no sub_layer exists

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_sub_networks()[source]
Return type:

list[returnn.tf.network.TFNetwork]

get_sub_layers()[source]
Return type:

list[LayerBase]

input_data: Optional[Data][source]
kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

RNN Cell Layer

class returnn.tf.layers.rec.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer. (But it can also handle the case to be optimized out of the rec loop,

i.e. outside a RecLayer, with a time dimension.)

Parameters:
  • n_out (int) – so far, only output shape (batch,n_out) supported

  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”

  • unit_opts (dict[str]|None) – passed to the cell.__init__

  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).

  • initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this

layer_class: Optional[str] = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters:
  • n_out (int)

  • name (str) – layer name

  • sources (list[LayerBase])

Return type:

Data

get_absolute_name_scope_prefix()[source]
Return type:

str

get_dep_layers()[source]
Return type:

list[tf.Tensor]

classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters:
  • n_out (int)

  • unit (str)

  • unit_opts (dict[str]|None)

Returns:

size or tuple of sizes

Return type:

int|tuple[int]

classmethod get_output_from_state(state, unit)[source]
Parameters:
  • state (tuple[tf.Tensor]|tf.Tensor)

  • unit (str)

Return type:

tf.Tensor

get_hidden_state()[source]
Returns:

state as defined by the cell

Return type:

tuple[tf.Tensor]|tf.Tensor

classmethod get_state_by_key(state, key, shape=None)[source]
Parameters:
  • state (tf.Tensor|tuple[tf.Tensor]|namedtuple)

  • key (int|str|None)

  • shape (tuple[int|None]) – Shape of the state.

Return type:

tf.Tensor

get_last_hidden_state(key)[source]
Parameters:

key (int|str|None)

Return type:

tf.Tensor

classmethod get_rec_initial_state(batch_dim, name, unit, sources, n_out=None, in_dim=None, out_dim=None, initial_state=None, unit_opts=None, rec_layer=None, axis=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search

  • name (str) – layer name

  • n_out (int|None) – out dim

  • in_dim (Dim|None)

  • out_dim (Dim|None) – out dim

  • unit (str) – cell name

  • sources (list[LayerBase])

  • unit_opts (dict[str]|None)

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code

  • rec_layer (RecLayer|LayerBase|None) – for the scope

  • axis (Dim|None)

Return type:

tf.Tensor|tuple[tf.Tensor]|namedtuple

classmethod get_rec_initial_state_inner(initial_shape, name, state_key=None, key=None, initial_state=None, shape_invariant=None, rec_layer=None)[source]

Generate initial hidden state. Primarily used as a inner function for RnnCellLayer.get_rec_initial_state.

Parameters:
  • initial_shape (tuple) – shape of the initial state.

  • name (str) – layer name.

  • state_key (str|None) – key to be used to get the state from final_rec_vars. “state” by default.

  • key (str|int|None) – key/attribute of the state if state is a dictionary/namedtuple (like ‘c’ and ‘h’ for LSTM states).

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code

  • shape_invariant (tuple) – If provided, directly used. Otherwise, guessed from initial_shape (see code below).

  • rec_layer (RecLayer|LayerBase|None) – For the scope.

Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:

dict[str,tf.Tensor|tuple[tf.Tensor]]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

static transform_initial_state(initial_state, network, get_layer)[source]
Parameters:
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None)

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]
Parameters:
  • unit (str)

  • initial_output (None)

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple)

Return type:

tf.Tensor

input_data: Optional[Data][source]
kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]

Self-Attention Layer

class returnn.tf.layers.rec.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

See also dot_product_attention here:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py

Parameters:
  • num_heads (int)

  • total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads

  • key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.

  • forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • attention_dropout (float)

  • attention_left_only (bool) – will mask out the future. see Attention is all you need.

  • initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().

  • restrict_state_to_last_seq (bool) – see code comment below

  • state_var_lengths (None|tf.Tensor|()->tf.Tensor) – if passed, a Tensor containing the number of keys in the state_var for each batch-entry, used for decoding in RASR.

layer_class: Optional[str] = 'self_attention'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, num_heads, total_key_dim, name, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, initial_state=None, sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str, tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, sources, network, num_heads, total_key_dim, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

dict[str, tf.TensorShape]

post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]
Parameters:
  • rec_vars_outputs (dict[str,tf.Tensor])

  • seq_len (tf.Tensor) – shape (batch,)

Return type:

dict[str,tf.Tensor]

input_data: Optional[Data][source]
kwargs: Optional[Dict[str]][source]
output_before_activation: Optional[OutputWithActivation][source]
output_loss: Optional[tf.Tensor][source]
rec_vars_outputs: Dict[str, tf.Tensor][source]
search_choices: Optional[SearchChoices][source]
params: Dict[str, tf.Variable][source]
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]
stats: Dict[str, tf.Tensor][source]