# TFNetworkRecLayer¶

class TFNetworkRecLayer.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

• BasicLSTM (the cell), via official TF, pure TF implementation
• LSTMBlock (the cell), via tf.contrib.rnn.
• LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
• CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
• NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
• NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
"class": "rec",
"from": ["input"],
"unit": {
# Recurrent subnet here, operate on a single time-step:
"output": {
"class": "linear",
"from": ["prev:output", "data:source"],
"activation": "relu",
"n_out": n_out},
},
"n_out": n_out},
}


More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

Parameters: unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps. unit_opts (None|dict[str]) – passed to RNNCell creation direction (int|None) – None|1 -> forward, -1 -> backward input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim initial_state (LayerBase|str|float|int|tuple|None) – max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code forward_weights_init (str) – see TFUtil.get_initializer() recurrent_weights_init (str) – see TFUtil.get_initializer() bias_init (str) – see TFUtil.get_initializer() optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible cheating (bool) – make targets available, and determine length by them unroll (bool) – if possible, unroll the loop (implementation detail)
layer_class = 'rec'[source]
recurrent = True[source]
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]

This method transforms the templates in the config dictionary into references of the layer instances (and creates them in the process). :param dict[str] d: will modify inplace :param TFNetwork.TFNetwork network: :param ((str) -> LayerBase) get_layer: function to get or construct another layer

classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters: kwargs – all the same kwargs as for self.__init__() Data template (placeholder not set) Data
get_absolute_name_scope_prefix()[source]
Returns: e.g. “output/”, always with “/” at end str
classmethod get_rnn_cell_class(name)[source]
Parameters: name (str) – cell name, minus the “Cell” at the end () -> rnn_cell.RNNCell|TFNativeOp.RecSeqCellOp
classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters: name (str) – layer name network (TFNetwork.TFNetwork) – loss (Loss|None) – argument just as for __init__ output (Data) – the output (template) for the layer reduce_func (((tf.Tensor)->tf.Tensor)|None) – layer (LayerBase|None) – kwargs – other layer kwargs list[TFNetwork.LossHolder]
get_constraints_value()[source]
Returns: None or scalar tf.Tensor|None
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters: reader (tf.train.CheckpointReader) – prefix (str) – e.g. “layer2/rec/” target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/” dict key -> value, {“…/kernel”: …, “…/bias”: …} with prefix dict[str,numpy.ndarray]
get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod is_prev_step_layer(layer)[source]
Parameters: layer (LayerBase) – bool
get_sub_layer(layer_name)[source]
Parameters: layer_name (str) – name of the sub_layer (right part of ‘/’ separated path) the sub_layer addressed in layer_name or None if no sub_layer exists LayerBase|None
class TFNetworkRecLayer.RecStepInfoLayer(i, end_flag=None, seq_lens=None, **kwargs)[source]

Used by _SubnetworkRecCell. Represents the current step number.

Parameters: i (tf.Tensor) – scalar, int32, current step (time) end_flag (tf.Tensor|None) – (batch,), bool, says that the current sequence has ended seq_lens (tf.Tensor|None) – (batch,) int32, seq lens
layer_class = ':i'[source]
get_end_flag()[source]
Returns: (batch,) of type bool. batch might include beam size tf.Tensor
class TFNetworkRecLayer.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer.

Parameters: n_out (int) – so far, only output shape (batch,n_out) supported unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock” unit_opts (dict[str]|None) – passed to the cell.__init__ initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state). initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this
layer_class = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters: n_out (int) – name (str) – layer name sources (list[LayerBase]) – Data
get_absolute_name_scope_prefix()[source]
Returns: e.g. “output/”, always with “/” at end str
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters: n_out (int) – unit (str) – unit_opts (dict[str]|None) – size or tuple of sizes int|tuple[int]
classmethod get_output_from_state(state, unit)[source]
Parameters: state (tuple[tf.Tensor]|tf.Tensor) – unit (str) – tf.Tensor
get_hidden_state()[source]
Returns: state as defined by the cell tuple[tf.Tensor]|tf.Tensor
classmethod get_state_by_key(state, key)[source]
Parameters: state (tf.Tensor|tuple[tf.Tensor]|namedtuple) – key (int|str|None) – tf.Tensor
get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod get_rec_initial_state(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, rec_layer=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters: batch_dim (tf.Tensor) – including beam size in beam search name (str) – layer name n_out (int) – out dim unit (str) – cell name unit_opts (dict[str]|None) – initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code rec_layer (RecLayer|LayerBase|None) – for the scope tf.Tensor|tuple[tf.Tensor]|namedtuple
classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Parameters: batch_dim (tf.Tensor) – for this layer, might be with beam rec_layer (TFNetworkRecLayer.RecLayer) – dict[str,tf.Tensor]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – will modify inplace network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer
static transform_initial_state(initial_state, network, get_layer)[source]
Parameters: initial_state (str|float|int|list[str|float|int]|dict[str]|None) – network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]

If this layer is used inside a recurrent layer, this function specifies the output of frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search.

Note: This could maybe share code with RnnCellLayer.get_rec_initial_state(). We could also add support to make the initial output be the output of another layer.

Parameters: batch_dim (tf.Tensor) – including beam size in beam search name (str) – layer name output (Data) – template rec_layer (TFNetworkRecLayer.RecLayer) – initial_output (str|float|int|tf.Tensor|None) – tf.Tensor
class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters: n_out (int) – dimension. output will be of shape (batch, n_out) combine (str) – “concat” or “add” key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()
layer_class = 'get_last_hidden_state'[source]
get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod get_out_data_from_opts(n_out, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters: kwargs – all the same kwargs as for self.__init__() Data template (placeholder not set) Data
class TFNetworkRecLayer.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Parameters: sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True
layer_class = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters: name (str) – sources (list[LayerBase]) – sub_layer (str) – Data
class TFNetworkRecLayer.ChoiceLayer(beam_size, input_type='prob', explicit_search_source=None, length_normalization=True, source_beam_sizes=None, scheduled_sampling=False, cheating=False, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

Parameters: beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …) input_type (str) – “prob” or “log_prob”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is. If there are several inputs, same format for all is assumed. explicit_search_source (LayerBase|None) – will mark it as an additional dependency length_normalization (bool) – evaluates score_t/len in search source_beam_sizes (list[int]|None) – If there are several sources, they are pruned with these beam sizes before combination. If None, ‘beam_size’ is used for all sources. Has to have same length as number of sources. scheduled_sampling (dict|None) – cheating (bool) – if True, will always add the true target in the beam
layer_class = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – will modify inplace network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(target, network, beam_size, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters: kwargs – all the same kwargs as for self.__init__() Data template (placeholder not set) Data
get_sub_layer(layer_name)[source]

Used to get outputs in case of multiple targets. For all targets we create a sub-layer that can be referred to by “self.name + ‘/out_’ + index” (e.g. output/out_0). These sublayers can then be used as input to other layers, e.g. “output_0”: {“class”: “copy”, “from”: [“output/out_0”].

Parameters: layer_name (str) – name of the sub_layer (e.g. ‘out_0’) internal layer that outputs labels for the target corresponding to layer_name InternalLayer
classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]
Parameters: layer_name (str) – name of the sub_layer (e.g. ‘out_0’), see self.get_sub_layer() parent_layer_kwargs (dict[str]) – kwargs for the parent layer, here we only need ‘network’ and ‘beam_size’ Data template, network and the class type of the sub-layer (Data, TFNetwork, type)|None
classmethod get_rec_initial_extra_outputs(network, beam_size, **kwargs)[source]
Parameters: network (TFNetwork.TFNetwork) – beam_size (int) – dict[str,tf.Tensor]
classmethod get_rec_initial_extra_outputs_shape_invariants(**kwargs)[source]
Returns: optional shapes for the tensors by get_rec_initial_extra_outputs dict[str,tf.TensorShape]
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
class TFNetworkRecLayer.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters: length_normalization (bool) – performed on the beam scores
layer_class = 'decide'[source]
classmethod decide(src, output=None, name=None, length_normalization=False)[source]
Parameters: src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim) output (Data|None) – name (str|None) – length_normalization (bool) – performed on the beam scores best beam selected from input, e.g. shape (batch, time, dim) Data
classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters: name (str) – sources (list[LayerBase]) – network (TFNetwork.TFNetwork) – Data
class TFNetworkRecLayer.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015, https://arxiv.org/abs/1409.0473 * Luong, Stanford, Effective Approaches to Attention-based Neural Machine Translation, 2015, https://arxiv.org/abs/1508.04025

-> dot, general, concat, location attention; comparison to Bahdanau
Parameters: base (LayerBase) – encoder output to attend on
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
get_base_weights()[source]

We can formulate most attentions as some weighted sum over the base time-axis.

Returns: the weighting of shape (batch, base_time), in case it is defined tf.Tensor|None
get_base_weight_last_frame()[source]

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns: shape (batch,) -> float (number 0..1) tf.Tensor
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – will modify inplace network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

classmethod get_out_data_from_opts(name, base, n_out=None, **kwargs)[source]
Parameters: name (str) – n_out (int|None) – base (LayerBase) – Data
class TFNetworkRecLayer.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]
Parameters: base (LayerBase) – encoder output to attend on base_ctx (LayerBase) – encoder output used to calculate the attention weights
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – will modify inplace network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

class TFNetworkRecLayer.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer.

Parameters: base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (…) + (n_out,) weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|() auto_squeeze (bool) – auto-squeeze any weight-axes with dim=1 away
layer_class = 'generic_attention'[source]
get_dep_layers()[source]
Returns: list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc. list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – will modify inplace network (TFNetwork.TFNetwork) – -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

classmethod get_out_data_from_opts(base, weights, auto_squeeze=True, **kwargs)[source]
Parameters: base (LayerBase) – weights (LayerBase) – auto_squeeze (bool) – Data
class TFNetworkRecLayer.DotAttentionLayer(energy_factor=None, **kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

Parameters: base (LayerBase) – encoder output to attend on. defines output-dim base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
layer_class = 'dot_attention'[source]
class TFNetworkRecLayer.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

layer_class = 'concat_attention'[source]
class TFNetworkRecLayer.GenericWindowAttentionLayer(weights, window_size, **kwargs)[source]
layer_class = 'generic_window_attention'[source]
class TFNetworkRecLayer.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

Parameters: window_size (int) – the window size where the Gaussian window will be applied on the base std (float) – standard deviation for Gauss inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5. inner_size_step (float) – see inner_size above
layer_class = 'gauss_window_attention'[source]
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
Parameters: name (str) – n_out (int|None) – base (LayerBase) – Data
class TFNetworkRecLayer.SelfAttentionLayer(num_heads, total_key_dim, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, **kwargs)[source]

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
Parameters: num_heads (int) – total_key_dim (int) – forward_weights_init (str) – see TFUtil.get_initializer() attention_dropout (float) – attention_left_only (bool) – will mask out the future. see Attention is all you need.
layer_class = 'self_attention'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources, **kwargs)[source]
Parameters: n_out (int) – name (str) – sources (list[LayerBase]) – Data
classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, num_heads, total_key_dim, n_out, sources=(), **kwargs)[source]
Parameters: batch_dim (tf.Tensor) – for this layer, might be with beam rec_layer (TFNetworkRecLayer.RecLayer) – dict[str,tf.Tensor]
classmethod get_rec_initial_extra_outputs_shape_invariants(num_heads, total_key_dim, n_out, sources, **kwargs)[source]
Returns: optional shapes for the tensors by get_rec_initial_extra_outputs dict[str,tf.TensorShape]
class TFNetworkRecLayer.PositionalEncodingLayer(add_to_input=False, **kwargs)[source]

Provides positional encoding in the form of (batch, time, n_out), where n_out is the number of channels, if it is run outside a RecLayer, or (batch, n_out) if run inside a RecLayer, where it will depend on the current time frame.

Assumes one source input with a time dimension if outside a RecLayer. By default (“from” key not provided), it would either use “data”, or “:i”. With add_to_input, it will calculate x + input.

The positional encoding is the same as in Tensor2Tensor. See TFUtil.get_positional_encoding().

layer_class = 'positional_encoding'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters: d (dict[str]) – network (TFNetwork.TFNetwork) – get_layer (((str)->LayerBase)) –
classmethod get_out_data_from_opts(name, network, add_to_input=False, sources=(), **kwargs)[source]
Parameters: name (str) – network (TFNetwork.TFNetwork) – add_to_input (bool) – sources (list[LayerBase]) – Data
class TFNetworkRecLayer.KenLmStateLayer(lm_file, vocab_file=None, vocab_unknown_label='UNK', bpe_merge_symbol=None, input_step_offset=0, dense_output=False, debug=False, **kwargs)[source]

Get next word (or subword) each frame, accumulates string, keeps state of seen string so far, returns score (+log space, natural base e) of sequence, using KenLM (http://kheafield.com/code/kenlm/) (see TFKenLM). EOS (</s>) token must be used explicitly.

Parameters: lm_file (str|()->str) – ARPA file or so. whatever KenLM supports vocab_file (str|None) – if the inputs are symbols, this must be provided. see Vocabulary vocab_unknown_label (str) – for the vocabulary bpe_merge_symbol (str|None) – e.g. “@@” if you want to apply BPE merging input_step_offset (int) – if provided, will consider the input only from this step onwards dense_output (bool) – whether we output the score for all possible succeeding tokens debug (bool) – prints debug info
layer_class = 'kenlm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, vocab_file=None, vocab_unknown_label='UNK', dense_output=False, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters: kwargs – all the same kwargs as for self.__init__() Data template (placeholder not set) Data
classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources=(), **kwargs)[source]
Parameters: batch_dim (tf.Tensor) – for this layer, might be with beam rec_layer (TFNetworkRecLayer.RecLayer) – dict[str,tf.Tensor]
class TFNetworkRecLayer.BaseRNNCell(trainable=True, name=None, dtype=None, activity_regularizer=None, **kwargs)[source]

Extends rnn_cell.RNNCell by having explicit static attributes describing some properties.

get_input_transformed(x, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters: x (tf.Tensor) – (time, batch, dim), or (batch, dim) batch_dim (tf.Tensor|None) – like x, maybe other feature-dim tf.Tensor|tuple[tf.Tensor]
class TFNetworkRecLayer.RHNCell(num_units, is_training=None, depth=5, dropout=0.0, dropout_seed=None, transform_bias=None, batch_size=None)[source]

Recurrent Highway Layer. With optional dropout for recurrent state (fixed over all frames - some call this variational).

References:
https://github.com/julian121266/RecurrentHighwayNetworks/ https://arxiv.org/abs/1607.03474
Parameters: num_units (int) – is_training (bool|tf.Tensor|None) – depth (int) – dropout (float) – dropout_seed (int) – transform_bias (float|None) – batch_size (int|tf.Tensor|None) –
output_size[source]

Integer or TensorShape: size of outputs produced by this cell.

state_size[source]

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_input_transformed(x, batch_dim=None)[source]
Parameters: x (tf.Tensor) – (time, batch, dim) (time, batch, num_units * 2) tf.Tensor
call(inputs, state)[source]
Parameters: inputs (tf.Tensor) – state (tf.Tensor) – (output, state) (tf.Tensor, tf.Tensor)
class TFNetworkRecLayer.BlocksparseLSTMCell(*args, **kwargs)[source]

Standard LSTM but uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

It uses our own wrapper, see TFNativeOp.init_blocksparse().

call(*args, **kwargs)[source]

The logic of the layer lives here.

Arguments:
inputs: input tensor(s). **kwargs: additional keyword arguments.
Returns:
Output tensor(s).
load_params_from_native_lstm(values_dict, session)[source]
Parameters: session (tf.Session) – values_dict (dict[str,numpy.ndarray]) –
class TFNetworkRecLayer.BlocksparseMultiplicativeMultistepLSTMCell(*args, **kwargs)[source]

Multiplicative LSTM with multiple steps, as in the OpenAI blocksparse paper. Uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

call(*args, **kwargs)[source]

The logic of the layer lives here.

Arguments:
inputs: input tensor(s). **kwargs: additional keyword arguments.
Returns:
Output tensor(s).
class TFNetworkRecLayer.LayerNormVariantsLSTMCell(num_units, norm_gain=1.0, norm_shift=0.0, forget_bias=0.0, activation=<function tanh>, is_training=None, dropout=0.0, dropout_h=0.0, dropout_seed=None, with_concat=False, global_norm=True, global_norm_joined=False, per_gate_norm=False, cell_norm=True, cell_norm_in_output=True, hidden_norm=False, variance_epsilon=1e-12)[source]

LSTM unit with layer normalization and recurrent dropout

This LSTM cell can apply different variants of layer normalization:

1. Layer normalization as in the original paper: Ref: https://arxiv.org/abs/1607.06450 This can be applied by having:

all default params (global_norm=True, cell_norm=True, cell_norm_in_output=True)

2. Layer normalization for RNMT+: Ref: https://arxiv.org/abs/1804.09849 This can be applied by having:

all default params except - global_norm = False - per_gate_norm = True - cell_norm_in_output = False

3. TF official LayerNormBasicLSTMCell Ref: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell This can be reproduced by having:

all default params except - global_norm = False - per_gate_norm = True

4. Sockeye LSTM layer normalization implementations Ref: https://github.com/awslabs/sockeye/blob/master/sockeye/rnn.py

LayerNormLSTMCell can be reproduced by having:
all default params except - with_concat = False (just efficiency, no difference in the model)
LayerNormPerGateLSTMCell can be reproduced by having:
all default params except: (- with_concat = False) - global_norm = False - per_gate_norm = True
Recurrent dropout is based on:
https://arxiv.org/abs/1603.05118

Prohibited LN combinations: - global_norm and global_norm_joined both enabled - per_gate_norm with global_norm or global_norm_joined

Parameters: num_units (int) – number of lstm units norm_gain (float) – layer normalization gain value norm_shift (float) – layer normalization shift (bias) value forget_bias (float) – the bias added to forget gates activation – Activation function to be applied in the lstm cell is_training (bool) – if True then we are in the training phase dropout (float) – dropout rate, applied on cell-in (j) dropout_h (float) – dropout rate, applied on hidden state (h) when it enters the LSTM (variational dropout) dropout_seed (int) – used to create random seeds with_concat (bool) – if True then the input and prev hidden state is concatenated for the computation. this is just about computation performance. global_norm (bool) – if True then layer normalization is applied for the forward and recurrent outputs (separately). global_norm_joined (bool) – if True, then layer norm is applied on LSTM in (forward and recurrent output together) per_gate_norm (bool) – if True then layer normalization is applied per lstm gate cell_norm (bool) – if True then layer normalization is applied to the LSTM new cell output cell_norm_in_output (bool) – if True, the normalized cell is also used in the output hidden_norm (bool) – if True then layer normalization is applied to the LSTM new hidden state output
output_size[source]

Integer or TensorShape: size of outputs produced by this cell.

state_size[source]

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_input_transformed(inputs, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters: x (tf.Tensor) – (time, batch, dim), or (batch, dim) batch_dim (tf.Tensor|None) – like x, maybe other feature-dim tf.Tensor|tuple[tf.Tensor]
class TFNetworkRecLayer.TwoDLSTMLayer(pooling='last', unit_opts=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, **kwargs)[source]
Parameters: pooling (str) – defines how the 1D return value is computed based on the 2D lstm result. Either ‘last’ or ‘max’ unit_opts (None|dict[str]) – passed to RNNCell creation forward_weights_init (str) – see TFUtil.get_initializer() recurrent_weights_init (str) – see TFUtil.get_initializer() bias_init (str) – see TFUtil.get_initializer()
layer_class = 'twod_lstm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(sources, n_out, name, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters: kwargs – all the same kwargs as for self.__init__() Data template (placeholder not set) Data
get_constraints_value()[source]
Returns: None or scalar tf.Tensor|None
classmethod helper_extra_outputs(batch_dim, src_length, features)[source]
classmethod get_rec_initial_extra_outputs(batch_dim, n_out, sources, **kwargs)[source]
Parameters: batch_dim (tf.Tensor) – for this layer, might be with beam rec_layer (TFNetworkRecLayer.RecLayer) – dict[str,tf.Tensor]
classmethod get_rec_initial_extra_outputs_shape_invariants(n_out, sources, **kwargs)[source]
Returns: optional shapes for the tensors by get_rec_initial_extra_outputs dict[str,tf.TensorShape]