Recurrent Layers

Recurrent Layer

class TFNetworkRecLayer.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, back_prop=None, use_global_rec_step_offset=False, include_eos=False, debug=None, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

  • BasicLSTM (the cell), via official TF, pure TF implementation
  • LSTMBlock (the cell), via tf.contrib.rnn.
  • LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
  • CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
  • NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
  • NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
    "class": "rec",
    "from": ["input"],
    "unit": {
      # Recurrent subnet here, operate on a single time-step:
      "output": {
        "class": "linear",
        "from": ["prev:output", "data:source"],
        "activation": "relu",
        "n_out": n_out},
    },
    "n_out": n_out},
}

More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

This layer can also be inside another RecLayer. In that case, it behaves similar to RnnCellLayer. (This support is somewhat incomplete yet. It should work for the native units such as NativeLstm.)

Parameters:
  • unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps.
  • unit_opts (None|dict[str]) – passed to RNNCell creation
  • direction (int|None) – None|1 -> forward, -1 -> backward
  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim
  • initial_state (LayerBase|str|float|int|tuple|None) –
  • max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str) – see TFUtil.get_initializer()
  • optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible
  • cheating (bool) – make targets available, and determine length by them
  • unroll (bool) – if possible, unroll the loop (implementation detail)
  • back_prop (bool|None) – for tf.while_loop. the default will use self.network.train_flag
  • use_global_rec_step_offset (bool) –
  • include_eos (bool) – for search, whether we should include the frame where “end” is True
  • debug (bool|None) –
layer_class = 'rec'[source]
recurrent = True[source]
get_dep_layers(self)[source]
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]

This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process). :param dict[str] d: will modify inplace :param TFNetwork.TFNetwork network: :param ((str) -> LayerBase) get_layer: function to get or construct another layer

classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]
Parameters:
  • unit (str|dict[str]) –
  • sources (list[LayerBase]) –
  • initial_state (str|LayerBase|list[str|LayerBase]) –
Return type:

Data

get_absolute_name_scope_prefix(self)[source]
Return type:str
classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:dict[str,tf.Tensor|tuple[tf.Tensor]]
classmethod get_rec_initial_output(**kwargs)[source]
Return type:tf.Tensor
classmethod get_rnn_cell_class(name)[source]
Parameters:name (str) – cell name, minus the “Cell” at the end
Return type:() -> rnn_cell.RNNCell|TFNativeOp.RecSeqCellOp
classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name
  • network (TFNetwork.TFNetwork) –
  • loss (Loss|None) – argument just as for __init__
  • output (Data) – the output (template) for the layer
  • reduce_func (((tf.Tensor)->tf.Tensor)|None) –
  • layer (LayerBase|None) –
  • kwargs – other layer kwargs
Return type:

list[TFNetwork.LossHolder]

get_constraints_value(self)[source]
Return type:tf.Tensor
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters:
  • reader (tf.train.CheckpointReader) –
  • prefix (str) – e.g. “layer2/rec/”
  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”
Returns:

dict key -> value, {“…/kernel”: …, “…/bias”: …} with prefix

Return type:

dict[str,numpy.ndarray]

get_last_hidden_state(self, key)[source]
Parameters:key (str|int|None) –
Return type:tf.Tensor
classmethod is_prev_step_layer(layer)[source]
Parameters:layer (LayerBase) –
Return type:bool
get_sub_layer(self, layer_name)[source]
Parameters:layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
Returns:the sub_layer addressed in layer_name or None if no sub_layer exists
Return type:LayerBase|None

RNN Cell Layer

class TFNetworkRecLayer.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer. (But it can also handle the case to be optimized out of the rec loop,

i.e. outside a RecLayer, with a time dimension.)
Parameters:
  • n_out (int) – so far, only output shape (batch,n_out) supported
  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”
  • unit_opts (dict[str]|None) – passed to the cell.__init__
  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).
  • initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this
layer_class = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters:
  • n_out (int) –
  • name (str) – layer name
  • sources (list[LayerBase]) –
Return type:

Data

get_absolute_name_scope_prefix(self)[source]
Return type:str
get_dep_layers(self)[source]
Return type:list[tf.Tensor]
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters:
  • n_out (int) –
  • unit (str) –
  • unit_opts (dict[str]|None) –
Returns:

size or tuple of sizes

Return type:

int|tuple[int]

classmethod get_output_from_state(state, unit)[source]
Parameters:
  • state (tuple[tf.Tensor]|tf.Tensor) –
  • unit (str) –
Return type:

tf.Tensor

get_hidden_state(self)[source]
Returns:state as defined by the cell
Return type:tuple[tf.Tensor]|tf.Tensor
classmethod get_state_by_key(state, key, shape=None)[source]
Parameters:
  • state (tf.Tensor|tuple[tf.Tensor]|namedtuple) –
  • key (int|str|None) –
  • shape (tuple[int|None]) – Shape of the state.
Return type:

tf.Tensor

get_last_hidden_state(self, key)[source]
Parameters:key (int|str|None) –
Return type:tf.Tensor
classmethod get_rec_initial_state(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, rec_layer=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search
  • name (str) – layer name
  • n_out (int) – out dim
  • unit (str) – cell name
  • unit_opts (dict[str]|None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
  • rec_layer (RecLayer|LayerBase|None) – for the scope
Return type:

tf.Tensor|tuple[tf.Tensor]|namedtuple

classmethod get_rec_initial_state_inner(initial_shape, name, state_key='state', key=None, initial_state=None, shape_invariant=None, rec_layer=None)[source]

Generate initial hidden state. Primarily used as a inner function for RnnCellLayer.get_rec_initial_state.

Parameters:
  • initial_shape (tuple) – shape of the initial state.
  • name (str) – layer name.
  • state_key (str) – key to be used to get the state from final_rec_vars.
  • key (str|None) – key/attribute of the state if state is a dictionary/namedtuple (like ‘c’ and ‘h’ for LSTM states).
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
  • shape_invariant (tuple) – If provided, directly used. Otherwise, guessed from initial_shape (see code below).
  • rec_layer (RecLayer|LayerBase|None) – For the scope.
Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:dict[str,tf.Tensor|tuple[tf.Tensor]]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
static transform_initial_state(initial_state, network, get_layer)[source]
Parameters:
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None) –
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]
Parameters:
  • unit (str) –
  • initial_output (None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) –
Return type:

tf.Tensor

Get Last Hidden State Layer

class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters:
  • n_out (int) – dimension. output will be of shape (batch, n_out)
  • combine (str) – “concat” or “add”
  • key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()
layer_class = 'get_last_hidden_state'[source]
get_last_hidden_state(self, key)[source]
Parameters:key (str|None) –
Return type:tf.Tensor
classmethod get_out_data_from_opts(n_out, **kwargs)[source]
Parameters:n_out (int) –
Return type:Data

Get Accumulated Output Layer

class TFNetworkRecLayer.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Note that this functionality is obsolete now. You can simply access such an sub layer via the generic sub layer access mechanism. I.e. instead of:

"sub_layer": {"class": "get_rec_accumulated", "from": "rec_layer", "sub_layer": "hidden"}

You can do:

"sub_layer": {"class": "copy", "from": "rec_layer/hidden"}
Parameters:sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True
layer_class = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • sub_layer (str) –
Return type:

Data

Positional Encoding Layer

class TFNetworkRecLayer.PositionalEncodingLayer(add_to_input=False, constant=-1, offset=None, **kwargs)[source]

Provides positional encoding in the form of (batch, time, n_out) or (time, batch, n_out) where n_out is the number of channels, if it is run outside a RecLayer, and (batch, n_out) or (n_out, batch) if run inside a RecLayer, where it will depend on the current time frame.

Assumes one source input with a time dimension if outside a RecLayer. With add_to_input, it will calculate x + input, and the output shape is the same as the input

The positional encoding is the same as in Tensor2Tensor. See TFUtil.get_positional_encoding().

Parameters:
  • add_to_input (bool) – will add the signal to the input
  • constant (int) – if positive, always output the corresponding positional encoding.
  • offset (None|LayerBase) – Specify the offset to be added to positions. Expect shape (batch, time) or (batch,).
layer_class = 'positional_encoding'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, add_to_input=False, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

Choice Layer

class TFNetworkRecLayer.ChoiceLayer(beam_size, keep_beams=False, search=<class 'Util.NotSpecified'>, input_type='prob', prob_scale=1.0, base_beam_score_scale=1.0, random_sample_scale=0.0, length_normalization=True, custom_score_combine=None, source_beam_sizes=None, scheduled_sampling=False, cheating=False, explicit_search_sources=None, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

Parameters:
  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)
  • keep_beams (bool) – specifies that we keep the beam_in entries, i.e. we just expand, i.e. we just search on the dim. beam_size must be a multiple of beam_in.
  • search (NotSpecified|bool) – whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.
  • input_type (str) – “prob” or “log_prob”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is. If there are several inputs, same format for all is assumed.
  • prob_scale (float) – factor for prob (score in +log space from source)
  • base_beam_score_scale (float) – factor for beam base score (i.e. prev prob scores)
  • random_sample_scale (float) – if >0, will add Gumbel scores. you might want to set base_beam_score_scale=0
  • length_normalization (bool) – evaluates score_t/len in search
  • source_beam_sizes (list[int]|None) – If there are several sources, they are pruned with these beam sizes before combination. If None, ‘beam_size’ is used for all sources. Has to have same length as number of sources.
  • scheduled_sampling (dict|None) –
  • cheating (bool) – if True, will always add the true target in the beam
  • explicit_search_sources (list[LayerBase]|None) – will mark it as an additional dependency. You might use these also in custom_score_combine.
  • custom_score_combine (callable|None) –
layer_class = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(name, sources, target, network, beam_size, search=<class 'Util.NotSpecified'>, scheduled_sampling=False, cheating=False, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • target (str) –
  • network (TFNetwork.TFNetwork) –
  • beam_size (int) –
  • search (NotSpecified|bool) –
  • scheduled_sampling (dict|bool) –
  • cheating (bool) –
Return type:

Data

get_sub_layer(self, layer_name)[source]

Used to get outputs in case of multiple targets. For all targets we create a sub-layer that can be referred to as “self.name + ‘/out_’ + index” (e.g. output/out_0). These sub-layers can then be used as input to other layers, e.g. “output_0”: {“class”: “copy”, “from”: [“output/out_0”].

Parameters:layer_name (str) – name of the sub_layer (e.g. ‘out_0’)
Returns:internal layer that outputs labels for the target corresponding to layer_name
Return type:InternalLayer
classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (e.g. ‘out_0’), see self.get_sub_layer()
  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer
Returns:

Data template, network and the class type of the sub-layer

Return type:

(Data, TFNetwork, type)|None

get_dep_layers(self)[source]
Return type:list[LayerBase]

Decision Layer

class TFNetworkRecLayer.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters:length_normalization (bool) – performed on the beam scores
layer_class = 'decide'[source]
classmethod cls_get_search_beam_size(network=None, **kwargs)[source]
Parameters:network (TFNetwork.TFNetwork) –
Return type:int|None
classmethod decide(src, output=None, owner=None, name=None, length_normalization=False)[source]
Parameters:
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)
  • output (Data|None) –
  • owner (LayerBase|None) –
  • name (str|None) –
  • length_normalization (bool) – performed on the beam scores
Returns:

best beam selected from input, e.g. shape (batch, time, dim)

Return type:

(Data, SearchChoices|None)

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data