TFNetworkRecLayer

Defines multiple recurrent layers, most importantly RecLayer.

class TFNetworkRecLayer.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, use_global_rec_step_offset=False, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

  • BasicLSTM (the cell), via official TF, pure TF implementation
  • LSTMBlock (the cell), via tf.contrib.rnn.
  • LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
  • CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
  • NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
  • NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
    "class": "rec",
    "from": ["input"],
    "unit": {
      # Recurrent subnet here, operate on a single time-step:
      "output": {
        "class": "linear",
        "from": ["prev:output", "data:source"],
        "activation": "relu",
        "n_out": n_out},
    },
    "n_out": n_out},
}

More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

This layer can also be inside another RecLayer. In that case, it behaves similar to RnnCellLayer. (This support is somewhat incomplete yet. It should work for the native units such as NativeLstm.)

Parameters:
  • unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps.
  • unit_opts (None|dict[str]) – passed to RNNCell creation
  • direction (int|None) – None|1 -> forward, -1 -> backward
  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim
  • initial_state (LayerBase|str|float|int|tuple|None) –
  • max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str) – see TFUtil.get_initializer()
  • optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible
  • cheating (bool) – make targets available, and determine length by them
  • unroll (bool) – if possible, unroll the loop (implementation detail)
  • use_global_rec_step_offset (bool) –
layer_class = 'rec'[source]
recurrent = True[source]
get_dep_layers(self)[source]
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]

This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process). :param dict[str] d: will modify inplace :param TFNetwork.TFNetwork network: :param ((str) -> LayerBase) get_layer: function to get or construct another layer

classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]
Parameters:
  • unit (str|dict[str]) –
  • sources (list[LayerBase]) –
  • initial_state (str|LayerBase|list[str|LayerBase]) –
Return type:

Data

get_absolute_name_scope_prefix(self)[source]
Return type:str
classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:dict[str,tf.Tensor|tuple[tf.Tensor]]
classmethod get_rec_initial_output(**kwargs)[source]
Return type:tf.Tensor
classmethod get_rnn_cell_class(name)[source]
Parameters:name (str) – cell name, minus the “Cell” at the end
Return type:() -> rnn_cell.RNNCell|TFNativeOp.RecSeqCellOp
classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name
  • network (TFNetwork.TFNetwork) –
  • loss (Loss|None) – argument just as for __init__
  • output (Data) – the output (template) for the layer
  • reduce_func (((tf.Tensor)->tf.Tensor)|None) –
  • layer (LayerBase|None) –
  • kwargs – other layer kwargs
Return type:

list[TFNetwork.LossHolder]

get_constraints_value(self)[source]
Return type:tf.Tensor
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters:
  • reader (tf.train.CheckpointReader) –
  • prefix (str) – e.g. “layer2/rec/”
  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”
Returns:

dict key -> value, {“…/kernel”: …, “…/bias”: …} with prefix

Return type:

dict[str,numpy.ndarray]

get_last_hidden_state(self, key)[source]
Parameters:key (str|int|None) –
Return type:tf.Tensor
classmethod is_prev_step_layer(layer)[source]
Parameters:layer (LayerBase) –
Return type:bool
get_sub_layer(self, layer_name)[source]
Parameters:layer_name (str) – name of the sub_layer (right part of ‘/’ separated path)
Returns:the sub_layer addressed in layer_name or None if no sub_layer exists
Return type:LayerBase|None
class TFNetworkRecLayer.RecStepInfoLayer(i, end_flag=None, end_flag_source=None, seq_lens=None, **kwargs)[source]

Used by _SubnetworkRecCell. Represents the current step number. Usually via TFNetwork.set_rec_step_info().

Parameters:
  • i (tf.Tensor) – scalar, int32, current step (time)
  • end_flag (tf.Tensor|None) – (batch,), bool, says that the current sequence has ended. Can be with beam. In that case, end_flag_source should be “prev:end”, and define the search choices.
  • end_flag_source (LayerBase|None) –
  • seq_lens (tf.Tensor|None) – (batch,) int32, seq lens
layer_class = ':i'[source]
get_end_flag(self, target_search_choices)[source]
Parameters:target_search_choices (SearchChoices|None) –
Returns:(batch,) of type bool. batch might include beam size
Return type:tf.Tensor
class TFNetworkRecLayer.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer. (But it can also handle the case to be optimized out of the rec loop,

i.e. outside a RecLayer, with a time dimension.)
Parameters:
  • n_out (int) – so far, only output shape (batch,n_out) supported
  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”
  • unit_opts (dict[str]|None) – passed to the cell.__init__
  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).
  • initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this
layer_class = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters:
  • n_out (int) –
  • name (str) – layer name
  • sources (list[LayerBase]) –
Return type:

Data

get_absolute_name_scope_prefix(self)[source]
Return type:str
get_dep_layers(self)[source]
Return type:list[tf.Tensor]
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters:
  • n_out (int) –
  • unit (str) –
  • unit_opts (dict[str]|None) –
Returns:

size or tuple of sizes

Return type:

int|tuple[int]

classmethod get_output_from_state(state, unit)[source]
Parameters:
  • state (tuple[tf.Tensor]|tf.Tensor) –
  • unit (str) –
Return type:

tf.Tensor

get_hidden_state(self)[source]
Returns:state as defined by the cell
Return type:tuple[tf.Tensor]|tf.Tensor
classmethod get_state_by_key(state, key, shape=None)[source]
Parameters:
  • state (tf.Tensor|tuple[tf.Tensor]|namedtuple) –
  • key (int|str|None) –
  • shape (tuple[int|None]) – Shape of the state.
Return type:

tf.Tensor

get_last_hidden_state(self, key)[source]
Parameters:key (int|str|None) –
Return type:tf.Tensor
classmethod get_rec_initial_state(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, rec_layer=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search
  • name (str) – layer name
  • n_out (int) – out dim
  • unit (str) – cell name
  • unit_opts (dict[str]|None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
  • rec_layer (RecLayer|LayerBase|None) – for the scope
Return type:

tf.Tensor|tuple[tf.Tensor]|namedtuple

classmethod get_rec_initial_state_inner(initial_shape, name, state_key='state', key=None, initial_state=None, shape_invariant=None, rec_layer=None)[source]

Generate initial hidden state. Primarily used as a inner function for RnnCellLayer.get_rec_initial_state.

Parameters:
  • initial_shape (tuple) – shape of the initial state.
  • name (str) – layer name.
  • state_key (str) – key to be used to get the state from final_rec_vars.
  • key (str|None) – key/attribute of the state if state is a dictionary/namedtuple (like ‘c’ and ‘h’ for LSTM states).
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
  • shape_invariant (tuple) – If provided, directly used. Otherwise, guessed from initial_shape (see code below).
  • rec_layer (RecLayer|LayerBase|None) – For the scope.
Return type:

tf.Tensor

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Return type:dict[str,tf.Tensor|tuple[tf.Tensor]]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
static transform_initial_state(initial_state, network, get_layer)[source]
Parameters:
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None) –
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]
Parameters:
  • unit (str) –
  • initial_output (None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) –
Return type:

tf.Tensor

class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters:
  • n_out (int) – dimension. output will be of shape (batch, n_out)
  • combine (str) – “concat” or “add”
  • key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()
layer_class = 'get_last_hidden_state'[source]
get_last_hidden_state(self, key)[source]
Parameters:key (str|None) –
Return type:tf.Tensor
classmethod get_out_data_from_opts(n_out, **kwargs)[source]
Parameters:n_out (int) –
Return type:Data
class TFNetworkRecLayer.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Note that this functionality is obsolete now. You can simply access such an sub layer via the generic sub layer access mechanism. I.e. instead of:

"sub_layer": {"class": "get_rec_accumulated", "from": "rec_layer", "sub_layer": "hidden"}

You can do:

"sub_layer": {"class": "copy", "from": "rec_layer/hidden"}
Parameters:sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True
layer_class = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • sub_layer (str) –
Return type:

Data

class TFNetworkRecLayer.ChoiceLayer(beam_size, search=<class 'Util.NotSpecified'>, input_type='prob', prob_scale=1.0, base_beam_score_scale=1.0, random_sample_scale=0.0, length_normalization=True, custom_score_combine=None, source_beam_sizes=None, scheduled_sampling=False, cheating=False, explicit_search_sources=None, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

Parameters:
  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)
  • search (NotSpecified|bool) – whether to perform search, or use the ground truth (target option). If not specified, it will depend on network.search_flag.
  • input_type (str) – “prob” or “log_prob”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is. If there are several inputs, same format for all is assumed.
  • prob_scale (float) – factor for prob (score in +log space from source)
  • base_beam_score_scale (float) – factor for beam base score (i.e. prev prob scores)
  • random_sample_scale (float) – if >0, will add Gumbel scores. you might want to set base_beam_score_scale=0
  • length_normalization (bool) – evaluates score_t/len in search
  • source_beam_sizes (list[int]|None) – If there are several sources, they are pruned with these beam sizes before combination. If None, ‘beam_size’ is used for all sources. Has to have same length as number of sources.
  • scheduled_sampling (dict|None) –
  • cheating (bool) – if True, will always add the true target in the beam
  • explicit_search_sources (list[LayerBase]|None) – will mark it as an additional dependency. You might use these also in custom_score_combine.
  • custom_score_combine (callable|None) –
layer_class = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(name, sources, target, network, beam_size, search=<class 'Util.NotSpecified'>, scheduled_sampling=False, cheating=False, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • target (str) –
  • network (TFNetwork.TFNetwork) –
  • beam_size (int) –
  • search (NotSpecified|bool) –
  • scheduled_sampling (dict|bool) –
  • cheating (bool) –
Return type:

Data

get_sub_layer(self, layer_name)[source]

Used to get outputs in case of multiple targets. For all targets we create a sub-layer that can be referred to by “self.name + ‘/out_’ + index” (e.g. output/out_0). These sublayers can then be used as input to other layers, e.g. “output_0”: {“class”: “copy”, “from”: [“output/out_0”].

Parameters:layer_name (str) – name of the sub_layer (e.g. ‘out_0’)
Returns:internal layer that outputs labels for the target corresponding to layer_name
Return type:InternalLayer
classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters:
  • layer_name (str) – name of the sub_layer (e.g. ‘out_0’), see self.get_sub_layer()
  • parent_layer_kwargs (dict[str]) – kwargs for the parent layer, here we only need ‘network’ and ‘beam_size’
Returns:

Data template, network and the class type of the sub-layer

Return type:

(Data, TFNetwork, type)|None

classmethod get_rec_initial_extra_outputs(network, beam_size, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(**kwargs)[source]
Return type:dict[str,tf.TensorShape]
get_dep_layers(self)[source]
Return type:list[LayerBase]
class TFNetworkRecLayer.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters:length_normalization (bool) – performed on the beam scores
layer_class = 'decide'[source]
classmethod decide(src, output=None, owner=None, name=None, length_normalization=False)[source]
Parameters:
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)
  • output (Data|None) –
  • owner (LayerBase|None) –
  • name (str|None) –
  • length_normalization (bool) – performed on the beam scores
Returns:

best beam selected from input, e.g. shape (batch, time, dim)

Return type:

(Data, SearchChoices)

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015,

Parameters:base (LayerBase) – encoder output to attend on
get_dep_layers(self)[source]
Return type:list[LayerBase]
get_base_weights(self)[source]

We can formulate most attentions as some weighted sum over the base time-axis.

Returns:the weighting of shape (batch, base_time), in case it is defined
Return type:tf.Tensor|None
get_base_weight_last_frame(self)[source]

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns:shape (batch,) -> float (number 0..1)
Return type:tf.Tensor
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, base, n_out=<class 'Util.NotSpecified'>, **kwargs)[source]
Parameters:
  • name (str) –
  • n_out (int|None|NotSpecified) –
  • base (LayerBase) –
Return type:

Data

class TFNetworkRecLayer.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]

Base class for other attention types, which use a global context.

Parameters:
  • base (LayerBase) – encoder output to attend on
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights
get_dep_layers(self)[source]
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
class TFNetworkRecLayer.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer. Note that we do not do any masking here. E.g. SoftmaxOverSpatialLayer does that.

Note that DotLayer is similar, just using a different terminology. Reduce axis: weights: time-axis; base: time-axis.

Note that if the last layer was SoftmaxOverSpatialLayer, we should use the same time-axis. Also we should do a check whether these time axes really match.

Common axes (should match): batch-axis, all from base excluding base feature axis and excluding time axis. Keep axes: base: feature axis; weights: all remaining, e.g. extra time.

Parameters:
  • base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (…) + (n_out,)
  • weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|()
  • auto_squeeze (bool) – auto-squeeze any weight-axes with dim=1 away
layer_class = 'generic_attention'[source]
get_dep_layers(self)[source]
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(base, weights, auto_squeeze=True, sources=(), **kwargs)[source]
Parameters:
  • base (LayerBase) –
  • weights (LayerBase) –
  • auto_squeeze (bool) –
  • sources (list[LayerBase]) – ignored, should be empty (checked in __init__)
Return type:

Data

class TFNetworkRecLayer.DotAttentionLayer(energy_factor=None, **kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

Parameters:
  • base (LayerBase) – encoder output to attend on. defines output-dim
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
  • energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
layer_class = 'dot_attention'[source]
class TFNetworkRecLayer.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

layer_class = 'concat_attention'[source]
class TFNetworkRecLayer.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

Parameters:
  • window_size (int) – the window size where the Gaussian window will be applied on the base
  • std (float) – standard deviation for Gauss
  • inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
  • inner_size_step (float) – see inner_size above
layer_class = 'gauss_window_attention'[source]
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
Parameters:inner_size (int|None) –
Return type:Data
class TFNetworkRecLayer.SelfAttentionLayer(num_heads, total_key_dim, key_shift=None, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, initial_state=None, restrict_state_to_last_seq=False, **kwargs)[source]

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

See also dot_product_attention here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
Parameters:
  • num_heads (int) –
  • total_key_dim (int) – i.e. key_dim == total_key_dim // num_heads
  • key_shift (LayerBase|None) – additive term to the key. can be used for relative positional encoding. Should be of shape (num_queries,num_keys,key_dim), currently without batch-dimension. I.e. that should be shape (1,t,key_dim) inside rec-layer or (T,T,key_dim) outside.
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • attention_dropout (float) –
  • attention_left_only (bool) – will mask out the future. see Attention is all you need.
  • initial_state (str|float|int|None) – see RnnCellLayer.get_rec_initial_state_inner().
  • restrict_state_to_last_seq (bool) – see code comment below
layer_class = 'self_attention'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(n_out, name, sources, **kwargs)[source]
Parameters:
  • n_out (int) –
  • name (str) –
  • sources (list[LayerBase]) –
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, num_heads, total_key_dim, n_out, name, initial_state=None, sources=(), **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • rec_layer (RecLayer|LayerBase) –
  • num_heads (int) –
  • total_key_dim (int) –
  • n_out (int) –
  • name (str) –
  • initial_state (str|float|int|None) –
  • sources (list[LayerBase]) –
Return type:

dict[str, tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(num_heads, total_key_dim, n_out, sources, **kwargs)[source]
Parameters:
  • num_heads (int) –
  • total_key_dim (int) –
  • n_out (int) –
  • sources (list[LayerBase]) –
Return type:

dict[str, tf.TensorShape]

post_process_final_rec_vars_outputs(self, rec_vars_outputs, seq_len)[source]
Parameters:
  • rec_vars_outputs (dict[str,tf.Tensor]) –
  • seq_len (tf.Tensor) – shape (batch,)
Return type:

dict[str,tf.Tensor]

class TFNetworkRecLayer.PositionalEncodingLayer(add_to_input=False, constant=-1, **kwargs)[source]

Provides positional encoding in the form of (batch, time, n_out), where n_out is the number of channels, if it is run outside a RecLayer, or (batch, n_out) if run inside a RecLayer, where it will depend on the current time frame.

Assumes one source input with a time dimension if outside a RecLayer. By default (“from” key not provided), it would either use “data”, or “:i”. With add_to_input, it will calculate x + input.

The positional encoding is the same as in Tensor2Tensor. See TFUtil.get_positional_encoding().

Parameters:
  • add_to_input (bool) – will add the signal to the input
  • constant (int) – if positive, always output the corresponding positional encoding.
layer_class = 'positional_encoding'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, add_to_input=False, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.KenLmStateLayer(lm_file, vocab_file=None, vocab_unknown_label='UNK', bpe_merge_symbol=None, input_step_offset=0, dense_output=False, debug=False, **kwargs)[source]

Get next word (or subword) each frame, accumulates string, keeps state of seen string so far, returns score (+log space, natural base e) of sequence, using KenLM (http://kheafield.com/code/kenlm/) (see TFKenLM). EOS (</s>) token must be used explicitly.

Parameters:
  • lm_file (str|()->str) – ARPA file or so. whatever KenLM supports
  • vocab_file (str|None) – if the inputs are symbols, this must be provided. see Vocabulary
  • vocab_unknown_label (str) – for the vocabulary
  • bpe_merge_symbol (str|None) – e.g. “@@” if you want to apply BPE merging
  • input_step_offset (int) – if provided, will consider the input only from this step onwards
  • dense_output (bool) – whether we output the score for all possible succeeding tokens
  • debug (bool) – prints debug info
layer_class = 'kenlm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, vocab_file=None, vocab_unknown_label='UNK', dense_output=False, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • vocab_file (str|None) –
  • vocab_unknown_label (str) –
  • dense_output (bool) –
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources=(), **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • rec_layer (RecLayer|LayerBase) –
  • sources (list[LayerBase]) –
Return type:

dict[str,tf.Tensor]

class TFNetworkRecLayer.EditDistanceTableLayer(debug=False, **kwargs)[source]

Given a source and a target, calculates the edit distance table between them. Source can be inside a recurrent loop. It uses TFNativeOp.next_edit_distance_row().

Usually, if you are inside a rec layer, and “output” is the ChoiceLayer, you would use “from”: “output” and “target”: “layer:base:data:target” (make sure it has the time dimension).

See also OptimalCompletionsLayer.

Parameters:debug (bool) –
layer_class = 'edit_distance_table'[source]
recurrent = True[source]
classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources, name, target, network, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_output(**kwargs)[source]
Return type:tf.Tensor
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, sources, target, network, _target_layers=None, **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.OptimalCompletionsLayer(debug=False, **kwargs)[source]

We expect to get the inputs from EditDistanceTableLayer, esp from the prev frame, like this: “opt_completions”: {“class”: “optimal_completions”, “from”: “prev:edit_dist_table”}.

You can also then define this further layer: “opt_completion_soft_targets”: {

“class”: “eval”, “eval”: “tf.nn.softmax(tf.cast(source(0), tf.float32))”, “from”: “opt_completions”, “out_type”: {“dtype”: “float32”}},

and use that as the CrossEntropyLoss soft targets for the input of the “output” ChoiceLayer, e.g. “output_prob”. This makes most sense when you enable beam search (even, or esp, during training). Note that you probably want to have this all before the last choice, where you still have more beams open.

Parameters:debug (bool) –
layer_class = 'optimal_completions'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, target, network, _target_layers=None, **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.BaseRNNCell(trainable=True, name=None, dtype=None, **kwargs)[source]

Extends rnn_cell.RNNCell by having explicit static attributes describing some properties.

get_input_transformed(self, x, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters:
  • x (tf.Tensor) – (time, batch, dim), or (batch, dim)
  • batch_dim (tf.Tensor|None) –
Returns:

like x, maybe other feature-dim

Return type:

tf.Tensor|tuple[tf.Tensor]

class TFNetworkRecLayer.RHNCell(num_units, is_training=None, depth=5, dropout=0.0, dropout_seed=None, transform_bias=None, batch_size=None)[source]

Recurrent Highway Layer. With optional dropout for recurrent state (fixed over all frames - some call this variational).

References:
https://github.com/julian121266/RecurrentHighwayNetworks/ https://arxiv.org/abs/1607.03474
Parameters:
  • num_units (int) –
  • is_training (bool|tf.Tensor|None) –
  • depth (int) –
  • dropout (float) –
  • dropout_seed (int) –
  • transform_bias (float|None) –
  • batch_size (int|tf.Tensor|None) –
output_size[source]
Return type:int
state_size[source]
Return type:int
get_input_transformed(self, x, batch_dim=None)[source]
Parameters:
  • x (tf.Tensor) – (time, batch, dim)
  • batch_dim (tf.Tensor|None) –
Returns:

(time, batch, num_units * 2)

Return type:

tf.Tensor

call(self, inputs, state)[source]
Parameters:
  • inputs (tf.Tensor) –
  • state (tf.Tensor) –
Returns:

(output, state)

Return type:

(tf.Tensor, tf.Tensor)

class TFNetworkRecLayer.BlocksparseLSTMCell(*args, **kwargs)[source]

Standard LSTM but uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

It uses our own wrapper, see TFNativeOp.init_blocksparse().

call(self, *args, **kwargs)[source]
Parameters:
  • args – passed to super
  • kwargs – passed to super
Return type:

tf.Tensor|tuple[tf.Tensor]

load_params_from_native_lstm(self, values_dict, session)[source]
Parameters:
  • session (tf.Session) –
  • values_dict (dict[str,numpy.ndarray]) –
class TFNetworkRecLayer.BlocksparseMultiplicativeMultistepLSTMCell(*args, **kwargs)[source]

Multiplicative LSTM with multiple steps, as in the OpenAI blocksparse paper. Uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

call(self, *args, **kwargs)[source]
Return type:tf.Tensor
class TFNetworkRecLayer.LayerNormVariantsLSTMCell(num_units, norm_gain=1.0, norm_shift=0.0, forget_bias=0.0, activation=<function tanh>, is_training=None, dropout=0.0, dropout_h=0.0, dropout_seed=None, with_concat=False, global_norm=True, global_norm_joined=False, per_gate_norm=False, cell_norm=True, cell_norm_in_output=True, hidden_norm=False, variance_epsilon=1e-12)[source]

LSTM unit with layer normalization and recurrent dropout

This LSTM cell can apply different variants of layer normalization:

1. Layer normalization as in the original paper: Ref: https://arxiv.org/abs/1607.06450 This can be applied by having:

all default params (global_norm=True, cell_norm=True, cell_norm_in_output=True)

2. Layer normalization for RNMT+: Ref: https://arxiv.org/abs/1804.09849 This can be applied by having:

all default params except - global_norm = False - per_gate_norm = True - cell_norm_in_output = False

3. TF official LayerNormBasicLSTMCell Ref: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell This can be reproduced by having:

all default params except - global_norm = False - per_gate_norm = True

4. Sockeye LSTM layer normalization implementations Ref: https://github.com/awslabs/sockeye/blob/master/sockeye/rnn.py

LayerNormLSTMCell can be reproduced by having:
all default params except - with_concat = False (just efficiency, no difference in the model)
LayerNormPerGateLSTMCell can be reproduced by having:
all default params except: (- with_concat = False) - global_norm = False - per_gate_norm = True
Recurrent dropout is based on:
https://arxiv.org/abs/1603.05118

Prohibited LN combinations: - global_norm and global_norm_joined both enabled - per_gate_norm with global_norm or global_norm_joined

Parameters:
  • num_units (int) – number of lstm units
  • norm_gain (float) – layer normalization gain value
  • norm_shift (float) – layer normalization shift (bias) value
  • forget_bias (float) – the bias added to forget gates
  • activation – Activation function to be applied in the lstm cell
  • is_training (bool) – if True then we are in the training phase
  • dropout (float) – dropout rate, applied on cell-in (j)
  • dropout_h (float) – dropout rate, applied on hidden state (h) when it enters the LSTM (variational dropout)
  • dropout_seed (int) – used to create random seeds
  • with_concat (bool) – if True then the input and prev hidden state is concatenated for the computation. this is just about computation performance.
  • global_norm (bool) – if True then layer normalization is applied for the forward and recurrent outputs (separately).
  • global_norm_joined (bool) – if True, then layer norm is applied on LSTM in (forward and recurrent output together)
  • per_gate_norm (bool) – if True then layer normalization is applied per lstm gate
  • cell_norm (bool) – if True then layer normalization is applied to the LSTM new cell output
  • cell_norm_in_output (bool) – if True, the normalized cell is also used in the output
  • hidden_norm (bool) – if True then layer normalization is applied to the LSTM new hidden state output
output_size[source]
Return type:int
state_size[source]
Return type:rnn_cell.LSTMStateTuple
get_input_transformed(self, inputs, batch_dim=None)[source]
Parameters:
  • inputs (tf.Tensor) –
  • batch_dim (tf.Tensor|None) –
Return type:

tf.Tensor

class TFNetworkRecLayer.TwoDLSTMLayer(pooling='last', unit_opts=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, **kwargs)[source]

2D LSTM.

Currently only from left-to-right in the time axis. Can be inside a recurrent loop, or outside.

Parameters:
layer_class = 'twod_lstm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(sources, n_out, name, **kwargs)[source]
Parameters:
  • sources (list[LayerBase]) –
  • n_out (int) –
  • name (str) –
Return type:

Data

get_constraints_value(self)[source]
Return type:tf.Tensor
classmethod helper_extra_outputs(batch_dim, src_length, features)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • src_length (tf.Tensor) –
  • features (tf.Tensor|int) –
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs(batch_dim, n_out, sources, **kwargs)[source]
Parameters:
  • batch_dim (tf.Tensor) –
  • n_out (int) –
  • sources (list[LayerBase]) –
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(n_out, sources, **kwargs)[source]
Returns:optional shapes for the tensors by get_rec_initial_extra_outputs
Return type:dict[str,tf.TensorShape]
class TFNetworkRecLayer.ZoneoutLSTMCell(num_units, zoneout_factor_cell=0.0, zoneout_factor_output=0.0)[source]

Wrapper for tf LSTM to create Zoneout LSTM Cell. This code is an adapted version of Rayhane Mamas version of Tacotron-2

Refs:

Initializer with possibility to set different zoneout values for cell/hidden states.

Parameters:
  • num_units (int) – number of hidden units
  • zoneout_factor_cell (float) – cell zoneout factor
  • zoneout_factor_output (float) – output zoneout factor
state_size[source]
Return type:int
output_size[source]
Return type:int