returnn.tf.layers.rec

Defines multiple recurrent layers, most importantly RecLayer.

class returnn.tf.layers.rec.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, max_seq_len_via=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, back_prop=None, use_global_rec_step_offset=False, include_eos=False, debug=None, axis=None, in_dim=None, out_dim=None, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM Benchmark (https://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

  • BasicLSTM (the cell), via official TF, pure TF implementation

  • LSTMBlock (the cell), via tf.contrib.rnn.

  • LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM

  • CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.

  • NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.

  • NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
    "class": "rec",
    "from": "input",
    "unit": {
      # Recurrent subnet here, operate on a single time-step:
      "output": {
        "class": "linear",
        "from": ["prev:output", "data:source"],
        "activation": "relu",
        "n_out": n_out},
    },
    "n_out": n_out},
}

More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

This layer can also be inside another RecLayer. In that case, it behaves similar to RnnCellLayer. (This support is somewhat incomplete yet. It should work for the native units such as NativeLstm.)

Also see Recurrency.

Parameters:
  • unit (str|_SubnetworkRecCell) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps. The subnetwork is specified as a net dict in the config.

  • unit_opts (None|dict[str]) – passed to RNNCell creation

  • direction (int|None) – None|1 -> forward, -1 -> backward

  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim

  • initial_state (LayerBase|str|float|int|tuple|None)

  • max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code

  • max_seq_len_via (LayerBase|None) – like max_seq_len but via another layer

  • forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • recurrent_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • bias_init (str) – see returnn.tf.util.basic.get_initializer()

  • optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible

  • cheating (bool) – Unused, is now part of ChoiceLayer

  • unroll (bool) – if possible, unroll the loop (implementation detail)

  • back_prop (bool|None) – for tf.while_loop. the default will use self.network.train_flag

  • use_global_rec_step_offset (bool)

  • include_eos (bool) – for search, whether we should include the frame where “end” is True

  • debug (bool|None)

  • axis (Dim|str) – specify the axis to iterate over. It can also be the special marker single_step_dim, or an outer recurrent time dim.

  • in_dim (Dim|None)

  • out_dim (Dim|None)

layer_class: Optional[str] = 'rec'[source]
recurrent = True[source]
SubnetworkRecCell[source]

alias of _SubnetworkRecCell

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_source_and_axis(network, source_data=None, have_dyn_seq_len_end=False, axis=None, opts=None)[source]
Parameters:
Return type:

(Data|None, Dim)

classmethod transform_config_dict(d, network, get_layer)[source]

This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process).

Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, network, sources, unit, axis=None, in_dim=None, out_dim=None, initial_state=None, **kwargs)[source]
Parameters:
Return type:

Data

get_absolute_name_scope_prefix()[source]
Return type:

str

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:

dict[str,tf.Tensor|tuple[tf.Tensor]]

classmethod get_rec_initial_output(**kwargs)[source]
Return type:

tf.Tensor

classmethod get_rnn_cell_class(name, cell_only=False)[source]
Parameters:
  • name (str|type) – cell name, minus the “Cell” at the end

  • cell_only (bool) – i.e. for single-step execution

Return type:

type[rnn_cell.RNNCell]|type[returnn.tf.native_op.RecSeqCellOp]

classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name

  • network (returnn.tf.network.TFNetwork)

  • loss (Loss|None) – argument just as for __init__

  • output (Data) – the output (template) for the layer

  • reduce_func (((tf.Tensor)->tf.Tensor)|None)

  • layer (LayerBase|None)

  • kwargs – other layer kwargs

Return type:

list[returnn.tf.network.LossHolder]

get_constraints_value()[source]
Return type:

tf.Tensor

static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters:
  • reader (tf.train.CheckpointReader)

  • prefix (str) – e.g. “layer2/rec/”

  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”

Returns:

dict key -> value, {”…/kernel”: …, “…/bias”: …} with prefix

Return type:

dict[str,numpy.ndarray]

get_last_hidden_state(key)[source]
Parameters:

key (str|int|None)

Return type:

tf.Tensor

classmethod is_prev_step_layer(layer)[source]
Parameters:

layer (LayerBase)

Return type:

bool

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

Returns:

the sub_layer addressed in layer_name or None if no sub_layer exists

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_sub_networks()[source]
Return type:

list[returnn.tf.network.TFNetwork]

get_sub_layers()[source]
Return type:

list[LayerBase]

class returnn.tf.layers.rec.RecStepInfoLayer(i=None, prev_end_flag=None, prev_end_layer=None, seq_lens=None, **kwargs)[source]

Used by _SubnetworkRecCell. Represents the current step number. Usually via TFNetwork.set_rec_step_info().

Parameters:
  • i (tf.Tensor|None) – scalar, int32, current step (time)

  • prev_end_flag (tf.Tensor|None) – (batch,), bool, says that the current sequence has ended. Can be with beam. In that case, end_flag_source should be “prev:end”, and define the search choices.

  • prev_end_layer (LayerBase|None) – corresponds to the “prev:end” layer if available

  • seq_lens (tf.Tensor|None) – (batch,) int32, seq lens

layer_class: Optional[str] = ':i'[source]
get_prev_end_flag(target_search_choices)[source]
Parameters:

target_search_choices (SearchChoices|None)

Returns:

(batch,) of type bool. batch might include beam size. This returns the end flag corresponding to the last frame. I.e. if the “end” layer exists and is used, this is “prev:end”.

Return type:

tf.Tensor

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(network, **kwargs)[source]
Parameters:

network (returnn.tf.network.TFNetwork)

Return type:

Data

class returnn.tf.layers.rec.RecLastOutputLayer(rec_layer, sub_layer_name, **kwargs)[source]

Gets the last output from some sub layer inside a RecLayer. You should explicitly set need_last on the specific layer such that this information is available.

Parameters:
  • rec_layer (RecLayer)

  • sub_layer_name (str)

layer_class: Optional[str] = 'rec_last_output'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(rec_layer, sub_layer_name, name, **kwargs)[source]
Parameters:
  • rec_layer (RecLayer)

  • sub_layer_name (str)

  • name (str)

Return type:

Data

class returnn.tf.layers.rec.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer. (But it can also handle the case to be optimized out of the rec loop,

i.e. outside a RecLayer, with a time dimension.)

Parameters:
  • n_out (int) – so far, only output shape (batch,n_out) supported

  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”

  • unit_opts (dict[str]|None) – passed to the cell.__init__

  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).

  • initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this

layer_class: Optional[str] = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters:
  • n_out (int)

  • name (str) – layer name

  • sources (list[LayerBase])

Return type:

Data

get_absolute_name_scope_prefix()[source]
Return type:

str

get_dep_layers()[source]
Return type:

list[tf.Tensor]

classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters:
  • n_out (int)

  • unit (str)

  • unit_opts (dict[str]|None)

Returns:

size or tuple of sizes

Return type:

int|tuple[int]

classmethod get_output_from_state(state, unit)[source]
Parameters:
  • state (tuple[tf.Tensor]|tf.Tensor)

  • unit (str)

Return type:

tf.Tensor

get_hidden_state()[source]
Returns:

state as defined by the cell

Return type:

tuple[tf.Tensor]|tf.Tensor

classmethod get_state_by_key(state, key, shape=None)[source]
Parameters:
  • state (tf.Tensor|tuple[tf.Tensor]|namedtuple)

  • key (int|str|None)

  • shape (tuple[int|None]) – Shape of the state.

Return type:

tf.Tensor

get_last_hidden_state(key)[source]
Parameters:

key (int|str|None)

Return type:

tf.Tensor

classmethod get_rec_initial_state(batch_dim, name, unit, sources, n_out=None, in_dim=None, out_dim=None, initial_state=None, unit_opts=None, rec_layer=None, axis=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search

  • name (str) – layer name

  • n_out (int|None) – out dim

  • in_dim (Dim|None)

  • out_dim (Dim|None) – out dim

  • unit (str) – cell name

  • sources (list[LayerBase])

  • unit_opts (dict[str]|None)

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code

  • rec_layer (RecLayer|LayerBase|None) – for the scope

  • axis (Dim|None)

Return type:

tf.Tensor|tuple[tf.Tensor]|namedtuple

classmethod get_rec_initial_state_inner(initial_shape, name, state_key=None, key=None, initial_state=None, shape_invariant=None, rec_layer=None)[source]

Generate initial hidden state. Primarily used as a inner function for RnnCellLayer.get_rec_initial_state.

Parameters:
  • initial_shape (tuple) – shape of the initial state.

  • name (str) – layer name.

  • state_key (str|None) – key to be used to get the state from final_rec_vars. “state” by default.

  • key (str|int|None) – key/attribute of the state if state is a dictionary/namedtuple (like ‘c’ and ‘h’ for LSTM states).

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code

  • shape_invariant (tuple) – If provided, directly used. Otherwise, guessed from initial_shape (see code below).

  • rec_layer (RecLayer|LayerBase|None) – For the scope.

Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:

dict[str,tf.Tensor|tuple[tf.Tensor]]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

static transform_initial_state(initial_state, network, get_layer)[source]
Parameters:
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None)

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]
Parameters:
  • unit (str)

  • initial_output (None)

  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple)

Return type:

tf.Tensor

class returnn.tf.layers.rec.GetLastHiddenStateLayer(out_dim=None, n_out=None, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters:
  • out_dim (Dim|None)

  • n_out (int|None) – dimension. output will be of shape (batch, n_out)

  • combine (str) – “concat” or “add”

  • key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()

layer_class: Optional[str] = 'get_last_hidden_state'[source]
get_last_hidden_state(key)[source]
Parameters:

key (str|None)

Return type:

tf.Tensor

classmethod get_out_data_from_opts(name, sources, out_dim=None, n_out=None, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • out_dim (Dim|None)

  • n_out (int|None) – dimension. output will be of shape (batch, n_out)

Return type:

Data

class returnn.tf.layers.rec.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Note that this functionality is obsolete now. You can simply access such an sub layer via the generic sub layer access mechanism. I.e. instead of:

"sub_layer": {"class": "get_rec_accumulated", "from": "rec_layer", "sub_layer": "hidden"}

You can do:

"sub_layer": {"class": "copy", "from": "rec_layer/hidden"}
Parameters:

sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True

layer_class: Optional[str] = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • sub_layer (str)

Return type:

Data

class returnn.tf.layers.rec.RecUnstackLayer(axis=None, declare_rec_time=False, **kwargs)[source]

This is supposed to be used inside a RecLayer. The input is supposed to be outside the rec layer (i.e. via base:). Uses tf.TensorArray and then unstack on the inputs to make it available per-frame. This is an alternative to making some input to the rec layer, such that the rec layer can have multiple inputs (as long as they have the same time dim).

Note that due to automatic optimization, this layer will be optimized out of the rec loop anyway, and then the tf.TensorArray logic happens internally in RecLayer, thus we do not need to care about this here. (See get_input_moved_out for some internal handling.)

Effectively, this layer is very similar to CopyLayer, with the only special behavior that it checks (or even assigns) the loop dimension of RecLayer.

Due to automatic optimization, not much happens here. The real logic happens in get_out_data_from_opts().

Note that it is allowed to leave both axis and declare_rec_time unset, in case you assign axis to the rec layer, and the source here has the same axis (dim tag).

Parameters:
  • axis (str|Dim|None)

  • declare_rec_time (bool)

layer_class: Optional[str] = 'rec_unstack'[source]
classmethod get_out_data_from_opts(name, sources, network, axis=None, declare_rec_time=False, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.BaseChoiceLayer(beam_size, search=<class 'returnn.util.basic.NotSpecified'>, add_to_beam_scores=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]

This is a base-class for any layer which defines a new search choice, i.e. which defines self.search_choices.

Parameters:
  • beam_size (int|None) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)

  • search (NotSpecified|bool) – whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.

  • add_to_beam_scores (NotSpecified|bool) – whether to add the scores to the beam scores. This will be done with search obviously (not supported to not do it). Without search, we can still add the scores of the ground-truth labels to the beam. By default, this is derived from search or network.search_flag. So with enabled net search flag, even when search is disabled here, it will add the scores.

classmethod cls_get_search_beam_size(network, beam_size, search=<class 'returnn.util.basic.NotSpecified'>, add_to_beam_scores=<class 'returnn.util.basic.NotSpecified'>, sources=(), _src_common_search_choices=None, **kwargs)[source]
Parameters:
Returns:

when this layer provides an own choice (search_choices attrib is set), then the corresponding beam size

Return type:

int|None

classmethod get_rec_initial_extra_outputs(network, beam_size, search=<class 'returnn.util.basic.NotSpecified'>, add_to_beam_scores=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(**kwargs)[source]
Return type:

dict[str,tf.TensorShape]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

class returnn.tf.layers.rec.ChoiceLayer(beam_size, keep_beams=False, search=<class 'returnn.util.basic.NotSpecified'>, add_to_beam_scores=<class 'returnn.util.basic.NotSpecified'>, input_type='prob', prob_scale=1.0, base_beam_score_scale=1.0, random_sample_scale=0.0, length_normalization=True, length_normalization_exponent=1.0, custom_score_combine=None, source_beam_sizes=None, scheduled_sampling=False, cheating=False, explicit_search_sources=None, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

In case of multiple sources, this layer computes the top-k combinations of choices. The score of such a combination is determined by adding up the (log-space) scores of the choices for the individual sources. In this case, the ‘target’ parameter of the layer has to be set to a list of targets corresponding to the sources respectively. Because computing all possible combinations of source scores is costly, the sources are pruned beforehand using the beam sizes set by the ‘source_beam_sizes’ parameter. The choices made for the different sources can be accessed via the sublayers ‘<choice layer name>/out_0’, ‘<choice layer name>/out_1’ and so on. Note, that the way scores are combined assumes the sources to be independent. If you want to model a dependency, use separate ChoiceLayers and let the input of one depend on the output of the other.

Parameters:
  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)

  • keep_beams (bool) – specifies that we keep the beam_in entries, i.e. we just expand, i.e. we just search on the dim. beam_size must be a multiple of beam_in.

  • search (NotSpecified|bool) – whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.

  • add_to_beam_scores (NotSpecified|bool) – whether to add the scores to the beam scores. This will be done with search obviously (not supported to not do it). Without search, we can still add the scores of the ground-truth labels to the beam. By default, this is derived from search or network.search_flag. So with enabled net search flag, even when search is disabled here, it will add the scores.

  • input_type (str) – “prob”, “log_prob” or “logits”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is. If there are several inputs, same format for all is assumed.

  • prob_scale (float) – factor for prob (score in +log space from source)

  • base_beam_score_scale (float) – factor for beam base score (i.e. prev prob scores)

  • random_sample_scale (float) – if >0, will add Gumbel scores. you might want to set base_beam_score_scale=0

  • length_normalization (bool) – evaluates score_t/len in search

  • source_beam_sizes (list[int]|None) – If there are several sources, they are pruned with these beam sizes before combination. If None, ‘beam_size’ is used for all sources. Has to have same length as number of sources.

  • scheduled_sampling (dict|None)

  • cheating (bool|str) – if True, will always add the true target in the beam. if “exclusive”, enables cheating_exclusive. see returnn.tf.util.basic.beam_search().

  • explicit_search_sources (list[LayerBase]|None) – will mark it as an additional dependency. You might use these also in custom_score_combine.

  • custom_score_combine (callable|None)

layer_class: Optional[str] = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, target, network, beam_size, search=<class 'returnn.util.basic.NotSpecified'>, scheduled_sampling=False, cheating=False, **kwargs)[source]
Parameters:
Return type:

Data

get_sub_layer(layer_name)[source]

Used to get outputs in case of multiple targets. For all targets we create a sub-layer that can be referred to as “self.name + ‘/out_’ + index” (e.g. output/out_0). These sub-layers can then be used as input to other layers, e.g. “output_0”: {“class”: “copy”, “from”: [“output/out_0”].

Parameters:

layer_name (str) – name of the sub_layer (e.g. ‘out_0’)

Returns:

internal layer that outputs labels for the target corresponding to layer_name

Return type:

InternalLayer|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (e.g. ‘out_0’), see self.get_sub_layer()

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod get_rec_initial_output(batch_dim, name, output, rec_layer, initial_output=None, **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search

  • name (str) – layer name

  • output (Data) – template

  • rec_layer (returnn.tf.layers.rec.RecLayer)

  • initial_output (str|float|int|tf.Tensor|None)

Return type:

tf.Tensor

post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]
Parameters:
  • rec_vars_outputs (dict[str,tf.Tensor])

  • seq_len (tf.Tensor) – shape (batch,)

Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.rec.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters:

length_normalization (bool) – performed on the beam scores

layer_class: Optional[str] = 'decide'[source]
classmethod cls_get_search_beam_size(sources, **kwargs)[source]
Parameters:

sources (list[LayerBase])

Return type:

int|None

classmethod decide(src, output=None, owner=None, name=None, length_normalization=False)[source]
Parameters:
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)

  • output (Data|None)

  • owner (LayerBase|None)

  • name (str|None)

  • length_normalization (bool) – performed on the beam scores

Returns:

best beam selected from input, e.g. shape (batch, time, dim)

Return type:

(Data, SearchChoices|None)

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.DecideKeepBeamLayer(sources, **kwargs)[source]

This just marks the search choices as decided, but does not change them (in contrast to DecideLayer). You can use this to get out some values as-is, without having them resolved to the final choices.

For internal usage only.

Parameters:

sources (list[LayerBase])

layer_class: Optional[str] = 'decide_keep_beam'[source]
classmethod cls_get_search_beam_size(sources, network, **kwargs)[source]
Parameters:
Return type:

int|None

classmethod get_rec_initial_extra_outputs(sources, **kwargs)[source]
Parameters:

sources (list[LayerBase])

Return type:

dict[str,tf.Tensor]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.ChoiceGetBeamScoresLayer(**kwargs)[source]

Gets beam scores from SearchChoices. This requires that the source has search choices.

Note

This layer might be deprecated in the future.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'choice_get_beam_scores'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.ChoiceGetSrcBeamsLayer(**kwargs)[source]

Gets source beam indices from SearchChoices. This requires that the source has search choices.

Usually the arguments, when specified in the network dict, are going through transform_config_dict(), before they are passed to here. See TFNetwork.construct_from_dict().

Parameters:
  • name (str)

  • network (returnn.tf.network.TFNetwork)

  • output (Data) – Set a specific output instead of using get_out_data_from_opts()

  • n_out (NotSpecified|None|int) – output dim

  • out_dim (returnn.tensor.Dim|None) – output feature dim tag

  • out_type (dict[str]) – kwargs for Data class. more explicit than n_out.

  • out_shape (set[returnn.tensor.Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) – verifies the output shape (dim tags). See Data.verify_out_shape().

  • sources (list[LayerBase]) – via self.transform_config_dict()

  • in_dim (returnn.tensor.Dim|None) – input feature dim tag

  • target (str|list[str]|None) – if some loss is set, this is the target data-key, i.e. network.extern_data.get_data(target). alternatively, this also can be a layer name.

  • _target_layers (dict[str,LayerBase]|None) – if target.startswith(“layer:”), then this is target -> layer

  • size_target (str|None) – like target but this is only used to set our output size in case of training

  • loss (Loss|None) – via transform_config_dict(). Every layer can have one loss (of type Loss), or none loss. In the net dict, it is specified as a string. In TFNetwork, all losses from all layers will be collected. That is what TFUpdater.Updater will use for training.

  • reuse_params (ReuseParams|None) – if given, will opt reuse the params. see self.var_creation_scope(). See also the name_scope option as an alternative.

  • name_scope (str|None) – If set, uses this custom (relative) name scope. If it starts with a “/”, it will be the absolute name scope. It should not end with a “/”. It can be empty, in which case it will not consume a new name scope. This can also be used for parameter sharing. The default is the layer name in most cases, but this logic is in get_absolute_name_scope_prefix() and TFNetwork.layer_creation_scope().

  • param_device (str|None) – e.g. “CPU”, etc. any valid name for tf.device. see https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/util/device_name_utils.h

  • L2 (float|None) – for constraints

  • darc1 (float|None) – for constraints. see Generalization in Deep Learning, https://arxiv.org/abs/1710.05468

  • spatial_smoothing (float|None) – see returnn.tf.util.basic.spatial_smoothing_energy()

  • param_variational_noise (float|None) – adds variational noise to the params during training

  • param_dropout (float|None) – dropout on params (weight dropout) during training

  • param_dropout_min_ndim (int|None) – if param dropout is enabled, only use if for params whose ndim >= this. E.g. it might make sense to disable it for bias params or scalars, so set param_dropout_min_ndim=2.

  • updater_opts (dict[str]|None) – accepts similar opts as TFUpdater, e.g. “optimizer”, “learning_rate”, …

  • is_output_layer (bool|None) – triggers the construction of this layer in the root net. Inside a RecLayer, it triggers the explicit accumulation of all frames. Also see the need_last option.

  • only_on_eval (bool) – if True, this layer will only be calculated in eval

  • only_on_search (bool) – if True, this layer will only be calculated when search is done

  • copy_output_loss_from_source_idx (int|None) – if set, will copy output_loss from this source

  • batch_norm (bool|dict) – see self.batch_norm()

  • initial_output (str|float) – used for recurrent layer, see self.get_rec_initial_output()

  • state – explicitly defines the rec state. initial_state would define the initial state (in the first frame)

  • need_last (bool) – Inside RecLayer, make sure that we can access the last frame. Similar to ``is_output_layer, but this is specifically about the last frame, i.e. it does not trigger accumulation.

  • rec_previous_layer (LayerBase|None) – via the recurrent layer, layer (template) which represents the past of us. You would not explicitly set this in a config. This is automatically, internally, via RecLayer.

  • encapsulate (bool) –

    mostly relevant for SubnetworkLayer and similar: If True, all sub layers will be created,

    and covered in functions like get_rec_initial_extra_outputs(), and the logic in cls_get_sub_network() will not be used.

    If False, the logic in cls_get_sub_network() will be used.

  • collocate_with (list[str]|None) – in the rec layer, collocate with the specified other layers

  • trainable (bool) – whether the parameters of this layer will be trained. Default is True. However, if this is inside a subnetwork, all the parent layers must be set to trainable, otherwise the parameters will not be trainable.

  • custom_param_importer (str|callable|None) – used by set_param_values_by_dict()

  • register_as_extern_data (str|None) – registers output in network.extern_data

  • control_dependencies_on_output (None|((LayerBase)->list[tf.Operation])) – This is mostly to perform some checks after the layer output has been computed, before the layer output is used anywhere else. There is also the IdentityLayer with the option control_dependencies.

  • debug_print_layer_output (None|bool|dict[str]) – same as global config option but per layer

  • _name (str) – just for internal construction, should be the same as name

  • _network (returnn.tf.network.TFNetwork) – just for internal construction, should be the same as network

  • _src_common_search_choices (None|SearchChoices) – set via SearchChoices.translate_to_common_search_beam()

layer_class: Optional[str] = 'choice_get_src_beams'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, sources, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.SplitBatchBeamLayer(beam_dim=None, **kwargs)[source]

Splits the batch dimension of the input, which includes a beam, into (batch,beam).

Like DecideLayer, this removes the beam.

Parameters:

beam_dim (Dim|None)

layer_class: Optional[str] = 'split_batch_beam'[source]
classmethod cls_get_search_beam_size(**kwargs)[source]
Return type:

int|None

classmethod get_out_data_from_opts(name, network, sources, beam_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015,

Parameters:

base (LayerBase) – encoder output to attend on

get_dep_layers()[source]
Return type:

list[LayerBase]

get_base_weights()[source]

We can formulate most attentions as some weighted sum over the base time-axis.

Returns:

the weighting of shape (batch, base_time), in case it is defined

Return type:

tf.Tensor|None

get_base_weight_last_frame()[source]

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns:

shape (batch,) -> float (number 0..1)

Return type:

tf.Tensor

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, base, n_out=<class 'returnn.util.basic.NotSpecified'>, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]

Base class for other attention types, which use a global context.

Parameters:
  • base (LayerBase) – encoder output to attend on

  • base_ctx (LayerBase) – encoder output used to calculate the attention weights

get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class returnn.tf.layers.rec.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer. Note that we do not do any masking here. E.g. SoftmaxOverSpatialLayer does that.

Note that DotLayer is similar, just using a different terminology. Reduce axis: weights: time-axis; base: time-axis.

Note that if the last layer was SoftmaxOverSpatialLayer, we should use the same time-axis. Also we should do a check whether these time axes really match.

Common axes (should match): batch-axis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.

Parameters:
  • base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (…) + (n_out,)

  • weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|()

  • auto_squeeze (bool) – auto-squeeze any weight-axes with dim=1 away

layer_class: Optional[str] = 'generic_attention'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(base, weights, auto_squeeze=True, sources=(), **kwargs)[source]
Parameters:
  • base (LayerBase)

  • weights (LayerBase)

  • auto_squeeze (bool)

  • sources (list[LayerBase]) – ignored, should be empty (checked in __init__)

Return type:

Data

class returnn.tf.layers.rec.DotAttentionLayer(energy_factor=None, **kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

Parameters:
  • base (LayerBase) – encoder output to attend on. defines output-dim

  • base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data

  • energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).

layer_class: Optional[str] = 'dot_attention'[source]
class returnn.tf.layers.rec.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

Parameters:
  • base (LayerBase) – encoder output to attend on

  • base_ctx (LayerBase) – encoder output used to calculate the attention weights

layer_class: Optional[str] = 'concat_attention'[source]
class returnn.tf.layers.rec.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

Parameters:
  • window_size (int) – the window size where the Gaussian window will be applied on the base

  • std (float) – standard deviation for Gauss

  • inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.

  • inner_size_step (float) – see inner_size above

layer_class: Optional[str] = 'gauss_window_attention'[source]
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
Parameters:

inner_size (int|None)

Return type:

Data

class returnn.tf.layers.rec.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, state_var_lengths=None, **kwargs)[source]

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

See also dot_product_attention here:

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py

Parameters:
  • num_heads (int)

  • total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads

  • key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.

  • forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • attention_dropout (float)

  • attention_left_only (bool) – will mask out the future. see Attention is all you need.

  • initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().

  • restrict_state_to_last_seq (bool) – see code comment below

  • state_var_lengths (None|tf.Tensor|()->tf.Tensor) – if passed, a Tensor containing the number of keys in the state_var for each batch-entry, used for decoding in RASR.

layer_class: Optional[str] = 'self_attention'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, n_out=<class 'returnn.util.basic.NotSpecified'>, out_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, num_heads, total_key_dim, name, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, initial_state=None, sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str, tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, sources, network, num_heads, total_key_dim, out_dim=<class 'returnn.util.basic.NotSpecified'>, n_out=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

dict[str, tf.TensorShape]

post_process_final_rec_vars_outputs(rec_vars_outputs, seq_len)[source]
Parameters:
  • rec_vars_outputs (dict[str,tf.Tensor])

  • seq_len (tf.Tensor) – shape (batch,)

Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.rec.PositionalEncodingLayer(axis=<class 'returnn.util.basic.NotSpecified'>, add_to_input=False, constant=-1, offset=None, **kwargs)[source]

Provides positional encoding in the form of (batch, time, n_out) or (time, batch, n_out) where n_out is the number of channels, if it is run outside a RecLayer, and (batch, n_out) or (n_out, batch) if run inside a RecLayer, where it will depend on the current time frame.

Assumes one source input with a time dimension if outside a RecLayer. With add_to_input, it will calculate x + input, and the output shape is the same as the input

The positional encoding is the same as in Tensor2Tensor. See returnn.tf.util.basic.get_positional_encoding().

Parameters:
  • axis (Dim|str|NotSpecified) – if not specified, check for time_dim_axis, otherwise assume rec step

  • add_to_input (bool) – will add the signal to the input

  • constant (int) – if positive, always output the corresponding positional encoding.

  • offset (None|LayerBase) – Specify the offset to be added to positions. Expect shape (batch, time) or (batch,).

layer_class: Optional[str] = 'positional_encoding'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, add_to_input=False, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, network, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, network, **kwargs)[source]
Parameters:
Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

class returnn.tf.layers.rec.KenLmStateLayer(lm_file, vocab_file=None, vocab_unknown_label='UNK', bpe_merge_symbol=None, axis=<class 'returnn.util.basic.NotSpecified'>, input_step_offset=0, dense_output=False, debug=False, **kwargs)[source]

Get next word (or subword) each frame, accumulates string, keeps state of seen string so far, returns score (+log space, natural base e) of sequence, using KenLM (https://kheafield.com/code/kenlm/) (see TFKenLM). EOS (</s>) token must be used explicitly.

Parameters:
  • lm_file (str|()->str) – ARPA file or so. whatever KenLM supports

  • vocab_file (str|None) – if the inputs are symbols, this must be provided. see Vocabulary

  • vocab_unknown_label (str) – for the vocabulary

  • bpe_merge_symbol (str|None) – e.g. “@@” if you want to apply BPE merging

  • axis (Dim|str|NotSpecified)

  • input_step_offset (int) – if provided, will consider the input only from this step onwards

  • dense_output (bool) – whether we output the score for all possible succeeding tokens

  • debug (bool) – prints debug info

layer_class: Optional[str] = 'kenlm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, vocab_file=None, vocab_unknown_label='UNK', dense_output=False, **kwargs)[source]
Parameters:
  • name (str)

  • sources (list[LayerBase])

  • vocab_file (str|None)

  • vocab_unknown_label (str)

  • dense_output (bool)

Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.rec.EditDistanceLayer(a, b, a_spatial_dim=None, b_spatial_dim=None, **kwargs)[source]

Edit distance, also known as Levenshtein distance, or in case of words, word error rate (WER), or in case of characters, character error rate (CER).

This will not normalize the result, i.e. return the absolut minimal number of edits (add, delete, replace) to transform the first string into the second string. For WER/CER, it is common to normalize by the length of the target string, but accumulated per epoch.

Parameters:
layer_class: Optional[str] = 'edit_distance'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, a, b, a_spatial_dim=None, b_spatial_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.EditDistanceTableLayer(axis=<class 'returnn.util.basic.NotSpecified'>, debug=False, blank_idx=None, out_dim=None, **kwargs)[source]

Given a source and a target, calculates the edit distance table between them. Source can be inside a recurrent loop. It uses TFNativeOp.next_edit_distance_row().

Usually, if you are inside a rec layer, and “output” is the ChoiceLayer, you would use “from”: “output” and “target”: “layer:base:data:target” (make sure it has the time dimension).

See also OptimalCompletionsLayer.

Parameters:
  • axis (Dim|str|NotSpecified)

  • debug (bool)

  • blank_idx (int|None) – if given, will keep the same row for this source label

  • out_dim (Dim|None)

layer_class: Optional[str] = 'edit_distance_table'[source]
recurrent = True[source]
classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources, name, target, network, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_output(**kwargs)[source]
Return type:

tf.Tensor

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, target, network, _target_layers=None, blank_idx=None, out_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.OptimalCompletionsLayer(debug=False, blank_idx=None, **kwargs)[source]

We expect to get the inputs from EditDistanceTableLayer, esp from the prev frame, like this: “opt_completions”: {“class”: “optimal_completions”, “from”: “prev:edit_dist_table”}.

You can also then define this further layer: “opt_completion_soft_targets”: {

“class”: “eval”, “eval”: “tf.nn.softmax(tf.cast(source(0), tf.float32))”, “from”: “opt_completions”, “out_type”: {“dtype”: “float32”}},

and use that as the CrossEntropyLoss soft targets for the input of the “output” ChoiceLayer, e.g. “output_prob”. This makes most sense when you enable beam search (even, or esp, during training). Note that you probably want to have this all before the last choice, where you still have more beams open.

Parameters:
  • debug (bool)

  • blank_idx (int|None)

layer_class: Optional[str] = 'optimal_completions'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, target, network, _target_layers=None, blank_idx=None, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.MaskedComputationLayer(mask, unit, masked_from, _layer_class, _layer_desc, in_spatial_dim=None, out_spatial_dim=None, _queried_sub_layers=None, _parent_layer_cache=None, **kwargs)[source]

Given some input [B,T,D] and some mask [B,T] (True or False), we want to perform a computation only on the masked frames. I.e. let T’ be the max seq len of the masked seq, then the masked input would be [B,T’,D]. (This masked input sequence could be calculated via tf.boolean_mask or tf.gather_nd.) The output is [B,T’,D’], i.e. we do not undo the masking. You are supposed to use UnmaskLayer to undo the masking.

The computation also works within a rec layer, i.e. the input is just [B,D] and the mask is just [B]. In that case, if the mask is True, it will perform the computation as normal, and if it is False, it will just copy the prev output, and also hidden state.

Parameters:
  • mask (LayerBase|None)

  • unit (dict[str])

  • masked_from (LayerBase|None)

  • in_spatial_dim (Dim|None)

  • out_spatial_dim (Dim|None) – the masked dim

  • _layer_class (type[LayerBase])

  • _layer_desc (dict[str])

  • _queried_sub_layers (dict[str,(Data,type,dict[str])]|None)

  • _parent_layer_cache (dict[str,LayerBase]|None)

layer_class: Optional[str] = 'masked_computation'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(network, masked_from=None, in_spatial_dim=None, out_spatial_dim=None, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer (as kwargs in cls.get_out_data_from_opts())

Returns:

Data template, class type of sub-layer, layer opts (transformed)

Return type:

(Data, type, dict[str])|None

get_sub_layer(layer_name)[source]
Parameters:

layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)

Returns:

the sub_layer addressed in layer_name or None if no sub_layer exists

Return type:

LayerBase|None

classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]
Parameters:

parent_layer_kwargs (dict[str])

Return type:

list[str]

get_constraints_value()[source]
Return type:

tf.Tensor|None

classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name

  • network (returnn.tf.network.TFNetwork)

  • loss (Loss|None) – argument just as for __init__

  • output (Data) – the output (template) for the layer

  • layer (LayerBase|None)

  • reduce_func (((tf.Tensor)->tf.Tensor)|None)

  • kwargs – other layer kwargs

Return type:

list[returnn.tf.network.LossHolder]

classmethod get_rec_initial_output(initial_output=None, **kwargs)[source]
Parameters:

initial_output

Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, **kwargs)[source]
Parameters:

rec_layer (returnn.tf.layers.rec.RecLayer)

Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

class returnn.tf.layers.rec.UnmaskLayer(mask, **kwargs)[source]

This is meant to be used together with MaskedComputationLayer, which operates on input [B,T,D], and given a mask, returns [B,T’,D’]. This layer UnmaskLayer is supposed to undo the masking, i.e. to recover the original time dimension, i.e. given [B,T’,D’], we output [B,T,D’]. This is done by repeating the output for the non-masked frames, via the last masked frame.

If this layer is inside a recurrent loop, i.e. we get [B,D’] as input, this is a no-op, and we just return the input as is. In that case, the repetition logic is handled via MaskedComputationLayer.

Parameters:

mask (LayerBase) – the same as as used for MaskedComputationLayer. Outside loop: [B,T] or [T,B], original T. Inside loop, just [B].

layer_class: Optional[str] = 'unmask'[source]
recurrent = True[source]
get_dep_layers()[source]
Return type:

list[LayerBase]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace

  • network (returnn.tf.network.TFNetwork)

  • get_layer (((str) -> LayerBase)) – function to get or construct another layer

classmethod get_out_data_from_opts(name, network, sources, mask, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

class returnn.tf.layers.rec.BaseRNNCell(*args, **kwargs)[source]

Extends rnn_cell.RNNCell by having explicit static attributes describing some properties.

get_input_transformed(x, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters:
  • x (tf.Tensor) – (time, batch, dim), or (batch, dim)

  • batch_dim (tf.Tensor|None)

Returns:

like x, maybe other feature-dim

Return type:

tf.Tensor|tuple[tf.Tensor]

class returnn.tf.layers.rec.VanillaLSTMCell(*args, **kwargs)[source]

Just a vanilla LSTM cell, which is compatible to our NativeLSTM (v1 and v2).

Parameters:

num_units (int)

property output_size[source]
Return type:

int

property state_size[source]
Return type:

rnn_cell.LSTMStateTuple

get_input_transformed(x, batch_dim=None)[source]
Parameters:
  • x (tf.Tensor) – (time, batch, dim), or (batch, dim)

  • batch_dim (tf.Tensor|None)

Returns:

like x, maybe other feature-dim

Return type:

tf.Tensor|tuple[tf.Tensor]

class returnn.tf.layers.rec.RHNCell(*args, **kwargs)[source]

Recurrent Highway Layer. With optional dropout for recurrent state (fixed over all frames - some call this variational).

References:

https://github.com/julian121266/RecurrentHighwayNetworks/ https://arxiv.org/abs/1607.03474

Parameters:
  • num_units (int)

  • is_training (bool|tf.Tensor|None)

  • depth (int)

  • dropout (float)

  • dropout_seed (int)

  • transform_bias (float|None)

  • batch_size (int|tf.Tensor|None)

property output_size[source]
Return type:

int

property state_size[source]
Return type:

int

get_input_transformed(x, batch_dim=None)[source]
Parameters:
  • x (tf.Tensor) – (time, batch, dim)

  • batch_dim (tf.Tensor|None)

Returns:

(time, batch, num_units * 2)

Return type:

tf.Tensor

call(inputs, state)[source]
Parameters:
  • inputs (tf.Tensor)

  • state (tf.Tensor)

Returns:

(output, state)

Return type:

(tf.Tensor, tf.Tensor)

class returnn.tf.layers.rec.BlocksparseLSTMCell(*args, **kwargs)[source]

Standard LSTM but uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

It uses our own wrapper, see TFNativeOp.init_blocksparse().

Parameters:

num_units (int)

call(*args, **kwargs)[source]
Parameters:
  • args – passed to super

  • kwargs – passed to super

Return type:

tf.Tensor|tuple[tf.Tensor]

load_params_from_native_lstm(values_dict, session)[source]
Parameters:
  • session (tf.compat.v1.Session)

  • values_dict (dict[str,numpy.ndarray])

class returnn.tf.layers.rec.BlocksparseMultiplicativeMultistepLSTMCell(*args, **kwargs)[source]

Multiplicative LSTM with multiple steps, as in the OpenAI blocksparse paper. Uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

Parameters:

num_units (int)

call(*args, **kwargs)[source]
Return type:

tf.Tensor

class returnn.tf.layers.rec.LayerNormVariantsLSTMCell(*args, **kwargs)[source]

LSTM unit with layer normalization and recurrent dropout

This LSTM cell can apply different variants of layer normalization:

1. Layer normalization as in the original paper: Ref: https://arxiv.org/abs/1607.06450 This can be applied by having:

all default params (global_norm=True, cell_norm=True, cell_norm_in_output=True)

2. Layer normalization for RNMT+: Ref: https://arxiv.org/abs/1804.09849 This can be applied by having:

all default params except - global_norm = False - per_gate_norm = True - cell_norm_in_output = False

3. TF official LayerNormBasicLSTMCell Ref: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell This can be reproduced by having:

all default params except - global_norm = False - per_gate_norm = True

4. Sockeye LSTM layer normalization implementations Ref: https://github.com/awslabs/sockeye/blob/master/sockeye/rnn.py

LayerNormLSTMCell can be reproduced by having:

all default params except - with_concat = False (just efficiency, no difference in the model)

LayerNormPerGateLSTMCell can be reproduced by having:

all default params except: (- with_concat = False) - global_norm = False - per_gate_norm = True

Recurrent dropout is based on:

https://arxiv.org/abs/1603.05118

Prohibited LN combinations: - global_norm and global_norm_joined both enabled - per_gate_norm with global_norm or global_norm_joined

Parameters:
  • num_units (int) – number of lstm units

  • norm_gain (float) – layer normalization gain value

  • norm_shift (float) – layer normalization shift (bias) value

  • forget_bias (float) – the bias added to forget gates

  • activation – Activation function to be applied in the lstm cell

  • is_training (bool) – if True then we are in the training phase

  • dropout (float) – dropout rate, applied on cell-in (j)

  • dropout_h (float) – dropout rate, applied on hidden state (h) when it enters the LSTM (variational dropout)

  • dropout_seed (int) – used to create random seeds

  • with_concat (bool) – if True then the input and prev hidden state is concatenated for the computation. this is just about computation performance.

  • global_norm (bool) – if True then layer normalization is applied for the forward and recurrent outputs (separately).

  • global_norm_joined (bool) – if True, then layer norm is applied on LSTM in (forward and recurrent output together)

  • per_gate_norm (bool) – if True then layer normalization is applied per lstm gate

  • cell_norm (bool) – if True then layer normalization is applied to the LSTM new cell output

  • cell_norm_in_output (bool) – if True, the normalized cell is also used in the output

  • hidden_norm (bool) – if True then layer normalization is applied to the LSTM new hidden state output

property output_size[source]
Return type:

int

property state_size[source]
Return type:

rnn_cell.LSTMStateTuple

get_input_transformed(inputs, batch_dim=None)[source]
Parameters:
  • inputs (tf.Tensor)

  • batch_dim (tf.Tensor|None)

Return type:

tf.Tensor

class returnn.tf.layers.rec.TwoDLSTMLayer(pooling='last', unit_opts=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, **kwargs)[source]

2D LSTM.

Currently only from left-to-right in the time axis. Can be inside a recurrent loop, or outside.

Parameters:
layer_class: Optional[str] = 'twod_lstm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(sources, n_out, name, **kwargs)[source]
Parameters:
  • sources (list[LayerBase])

  • n_out (int)

  • name (str)

Return type:

Data

get_constraints_value()[source]
Return type:

tf.Tensor

classmethod helper_extra_outputs(batch_dim, src_length, features)[source]
Parameters:
  • batch_dim (tf.Tensor)

  • src_length (tf.Tensor)

  • features (tf.Tensor|int)

Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs(batch_dim, n_out, sources, **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor)

  • n_out (int)

  • sources (list[LayerBase])

Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, n_out, sources, **kwargs)[source]
Parameters:
Returns:

optional shapes for the tensors by get_rec_initial_extra_outputs

Return type:

dict[str,tf.TensorShape]

class returnn.tf.layers.rec.ZoneoutLSTMCell(*args, **kwargs)[source]

Wrapper for tf LSTM to create Zoneout LSTM Cell. This code is an adapted version of Rayhane Mamas version of Tacotron-2

Refs:

Initializer with possibility to set different zoneout values for cell/hidden states.

Parameters:
  • num_units – number of hidden units

  • zoneout_factor_cell – cell zoneout factor

  • zoneout_factor_output – output zoneout factor

  • use_zoneout_output – If False, return the direct output of the underlying LSTM, without applying zoneout. So the output is different from h. This is different from the original Zoneout LSTM paper. If True, h is the same as output, and it is the same as the original Zoneout LSTM paper. This was False in our earlier implementation, and up to behavior version 16. Since behavior version 17, the default is True.

property state_size[source]
Return type:

int

property output_size[source]
Return type:

int

class returnn.tf.layers.rec.RelativePositionalEncodingLayer(out_dim=None, n_out=None, forward_weights_init='glorot_uniform', clipping=16, fixed=False, query_spatial_dim=<class 'returnn.util.basic.NotSpecified'>, key_value_spatial_dim=<class 'returnn.util.basic.NotSpecified'>, query_offset=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]

Relative positioning term as introduced by Shaw et al., 2018

Usually added to Self-Attention using key_shift. Parts of the code are adapted from Tensor2Tensor (https://github.com/tensorflow/tensor2tensor).

In general, the output is [length|1 (query-len), length (key|value-len), n_out], intended for self-attention.

If inside a recurrent loop, will take the current rec step as the position, and relates it to self or previous positions, so the output is [1 (query-len), t+1 (key|value-len), n_out]. Otherwise, it assumes that the input has a time dimension, and relates each position to all others, so the output is [length (query-len), length (key|value-len), n_out].

The input defines the query/key/value length by default.

Example usage:

d[output + '_rel_pos'] = {"class": "relative_positional_encoding",
                          "from": [output + '_self_att_laynorm'],
                          "n_out": self.EncKeyTotalDim // self.AttNumHeads,
                          "forward_weights_init": self.ff_init}
d[output + '_self_att_att'] = {"class": "self_attention",
                               "num_heads": self.AttNumHeads,
                               "total_key_dim": self.EncKeyTotalDim,
                               "n_out": self.EncValueTotalDim, "from": [output + '_self_att_laynorm'],
                               "attention_left_only": False, "attention_dropout": self.attention_dropout,
                               "forward_weights_init": self.ff_init,
                               "key_shift": output + '_rel_pos'}
Parameters:
  • out_dim (Dim|None) – Feature dimension of encoding.

  • n_out (int|None) – Feature dimension of encoding.

  • clipping (int) – After which distance to fallback to the last encoding

  • fixed (bool) – Uses sinusoid positional encoding instead of learned parameters

  • forward_weights_init (str) – see returnn.tf.util.basic.get_initializer()

  • query_spatial_dim (Dim|str|None|NotSpecified) – spatial dimension of query

  • key_value_spatial_dim (Dim|str|NotSpecified) – spatial dimension of key/value

  • query_offset (int|NotSpecified) – offset for query position. The default behavior: In case key_value_spatial_dim is not specified, input has no time dim, we assume that we are inside a rec loop and use the current step.

layer_class: Optional[str] = 'relative_positional_encoding'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, network, out_dim=None, n_out=None, query_spatial_dim=<class 'returnn.util.basic.NotSpecified'>, key_value_spatial_dim=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]
Parameters:
Return type:

Data

class returnn.tf.layers.rec.CumConcatLayer(out_spatial_dim, axis=None, **kwargs)[source]

Concatenates all previous frames of a time-axis. Like CumsumLayer uses sum, this layer uses concat.

This layer can be used as a base for auto-regressive self-attention.

This layer expects to be inside a RecLayer.

Inside a rec loop (not optimized out), this will concatenate the current input to the previous accumulated inputs. For an input of shape input_shape, it will output a tensor of shape [new_dim] + input_shape. new_dim (out_spatial_dim) is a special dimension, usually of length i, where i is the current loop frame, i.e. the length increases in every loop frame. new_dim is specified by a separate own dim tag. For example, in the first frame, this will be of shape [1] + input_shape, in the second frame shape [2] + input_shape, and so on, and in the last frame shape [T] + input_shape.

Outside the rec loop (optimized out), this layer expects an input with the time dim of the rec layer, and returns the input as-is, but replacing the time dim tag with the dim tag new_dim converted as outside the loop.

Normally the optimization should not matter for the user, i.e. for the user, the logical behavior is always as being inside the rec loop. Outside the loop, the output represents a tensor of shape [T, new_dim] + input_shape, although we actually have another new_dim outside the loop, and T is not actually there, but we still have all the information, because the last frame has all information. This new_dim outside the loop stores all the dynamic seq lengths per frame of the loop, i.e. the dyn seq len are extended of shape [B,T] or [T] (unlike usually just [B]). This way following layers use different seq lengths of new_dim for different loop frames, just like if the T dim would actually exist.

See https://github.com/rwth-i6/returnn/issues/391 for the initial discussion on how to generalize the SelfAttentionLayer which lead to this design.

Parameters:
  • out_spatial_dim (Dim)

  • axis (Dim|None) – to operate over. only single_step_dim supported currently, assumes to be inside rec layer

layer_class: Optional[str] = 'cum_concat'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, network, sources, out_spatial_dim, **kwargs)[source]
Parameters:
Return type:

Data

classmethod get_rec_initial_extra_outputs(network, batch_dim, rec_layer, sources, output, out_spatial_dim, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(rec_layer, network, sources, output, **kwargs)[source]
Parameters:
Return type:

dict[str, tf.TensorShape]