TFNetworkRecLayer

class TFNetworkRecLayer.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=None, cheating=False, unroll=False, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (http://returnn.readthedocs.io/en/latest/tf_lstm_benchmark.html), and also GRU, or simple RNN. Via unit parameter, you specify the operation/model performed in the recurrence. It can be a string and specify a RNN cell, where all TF cells can be used, and the “Cell” suffix can be omitted; and case is ignored. Some possible LSTM implementations are (in all cases for both CPU and GPU):

  • BasicLSTM (the cell), via official TF, pure TF implementation
  • LSTMBlock (the cell), via tf.contrib.rnn.
  • LSTMBlockFused, via tf.contrib.rnn. should be much faster than BasicLSTM
  • CudnnLSTM, via tf.contrib.cudnn_rnn. This is experimental yet.
  • NativeLSTM, our own native LSTM. should be faster than LSTMBlockFused.
  • NativeLstm2, improved own native LSTM, should be the fastest and most powerful.

We default to the current tested fastest one, i.e. NativeLSTM. Note that they are currently not compatible to each other, i.e. the way the parameters are represented.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario. The subnetwork will get the extern data from the parent net as templates, and if there is input to the RecLayer, then it will be available as the “source” data key in the subnetwork. The subnetwork is specified as a dict for the unit parameter. In the subnetwork, you can access outputs from layers from the previous time step when they are referred to with the “prev:” prefix.

Example:

{
    "class": "rec",
    "from": ["input"],
    "unit": {
      # Recurrent subnet here, operate on a single time-step:
      "output": {
        "class": "linear",
        "from": ["prev:output", "data:source"],
        "activation": "relu",
        "n_out": n_out},
    },
    "n_out": n_out},
}

More examples can be seen in test_TFNetworkRecLayer and test_TFEngine.

The subnetwork can automatically optimize the inner recurrent loop by moving layers out of the loop if possible. It will try to do that greedily. This can be disabled via the option optimize_move_layers_out. It assumes that those layers behave the same with time-dimension or without time-dimension and used per-step. Examples for such layers are LinearLayer, RnnCellLayer or SelfAttentionLayer with option attention_left_only.

Parameters:
  • unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps.
  • unit_opts (None|dict[str]) – passed to RNNCell creation
  • direction (int|None) – None|1 -> forward, -1 -> backward
  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim
  • initial_state (LayerBase|str|float|int|tuple|None) –
  • max_seq_len (int|tf.Tensor|None) – if unit is a subnetwork. str will be evaluated. see code
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str) – see TFUtil.get_initializer()
  • optimize_move_layers_out (bool|None) – will automatically move layers out of the loop when possible
  • cheating (bool) – make targets available, and determine length by them
  • unroll (bool) – if possible, unroll the loop (implementation detail)
layer_class = 'rec'[source]
recurrent = True[source]
get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters:kwargs – all the same kwargs as for self.__init__()
Returns:Data template (placeholder not set)
Return type:Data
get_absolute_name_scope_prefix()[source]
Returns:e.g. “output/”, always with “/” at end
Return type:str
classmethod get_rnn_cell_class(name)[source]
Parameters:name (str) – cell name, minus the “Cell” at the end
Return type:() -> rnn_cell.RNNCell|TFNativeOp.RecSeqCellOp
classmethod get_losses(name, network, output, loss=None, reduce_func=None, layer=None, **kwargs)[source]
Parameters:
  • name (str) – layer name
  • network (TFNetwork.TFNetwork) –
  • loss (Loss|None) – argument just as for __init__
  • output (Data) – the output (template) for the layer
  • reduce_func (((tf.Tensor)->tf.Tensor)|None) –
  • layer (LayerBase|None) –
  • kwargs – other layer kwargs
Return type:

list[TFNetwork.LossHolder]

get_constraints_value()[source]
Returns:None or scalar
Return type:tf.Tensor|None
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

Parameters:
  • reader (tf.train.CheckpointReader) –
  • prefix (str) – e.g. “layer2/rec/”
  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”
Returns:

dict key -> value, {“…/kernel”: …, “…/bias”: …} with prefix

Return type:

dict[str,numpy.ndarray]

get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod is_prev_step_layer(layer)[source]
Parameters:layer (LayerBase) –
Return type:bool
class TFNetworkRecLayer.RecStepInfoLayer(i, end_flag=None, seq_lens=None, **kwargs)[source]

Used by _SubnetworkRecCell. Represents the current step number.

Parameters:
  • i (tf.Tensor) – scalar, int32, current step (time)
  • end_flag (tf.Tensor|None) – (batch,), bool, says that the current sequence has ended
  • seq_lens (tf.Tensor|None) – (batch,) int32, seq lens
layer_class = ':i'[source]
get_end_flag()[source]
Returns:(batch,) of type bool. batch might include beam size
Return type:tf.Tensor
class TFNetworkRecLayer.RnnCellLayer(n_out, unit, unit_opts=None, initial_state=None, initial_output=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer.

Parameters:
  • n_out (int) – so far, only output shape (batch,n_out) supported
  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”
  • unit_opts (dict[str]|None) – passed to the cell.__init__
  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self.get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).
  • initial_output (None) – the initial output is defined implicitly via initial state, thus don’t set this
layer_class = 'rnn_cell'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
Parameters:
  • n_out (int) –
  • name (str) – layer name
  • sources (list[LayerBase]) –
Return type:

Data

get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Parameters:
  • n_out (int) –
  • unit (str) –
  • unit_opts (dict[str]|None) –
Returns:

size or tuple of sizes

Return type:

int|tuple[int]

classmethod get_output_from_state(state, unit)[source]
Parameters:
  • state (tuple[tf.Tensor]|tf.Tensor) –
  • unit (str) –
Return type:

tf.Tensor

get_hidden_state()[source]
Returns:state as defined by the cell
Return type:tuple[tf.Tensor]|tf.Tensor
classmethod get_state_by_key(state, key)[source]
Parameters:
  • state (tf.Tensor|tuple[tf.Tensor]|namedtuple) –
  • key (int|str|None) –
Return type:

tf.Tensor

get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod get_rec_initial_state(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, rec_layer=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search
  • name (str) – layer name
  • n_out (int) – out dim
  • unit (str) – cell name
  • unit_opts (dict[str]|None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
  • rec_layer (RecLayer|LayerBase|None) – for the scope
Return type:

tf.Tensor|tuple[tf.Tensor]|namedtuple

classmethod get_rec_initial_extra_outputs(**kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
static transform_initial_state(initial_state, network, get_layer)[source]
Parameters:
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None) –
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_rec_initial_output(unit, initial_output=None, initial_state=None, **kwargs)[source]

If this layer is used inside a recurrent layer, this function specifies the output of frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search.

Note: This could maybe share code with RnnCellLayer.get_rec_initial_state(). We could also add support to make the initial output be the output of another layer.

Parameters:
  • batch_dim (tf.Tensor) – including beam size in beam search
  • name (str) – layer name
  • output (Data) – template
  • rec_layer (TFNetworkRecLayer.RecLayer) –
  • initial_output (str|float|int|tf.Tensor|None) –
Return type:

tf.Tensor

class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', key='*', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

Parameters:
  • n_out (int) – dimension. output will be of shape (batch, n_out)
  • combine (str) – “concat” or “add”
  • key (str|int|None) – for the state, which could be a namedtuple. see RnnCellLayer.get_state_by_key()
layer_class = 'get_last_hidden_state'[source]
get_last_hidden_state(key)[source]

If this is a recurrent layer, this would return the last hidden state. Otherwise, we return None. :param int|str|None key: also the special key “*” :rtype: tf.Tensor | None :return: optional tensor with shape (batch, dim)

classmethod get_out_data_from_opts(n_out, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters:kwargs – all the same kwargs as for self.__init__()
Returns:Data template (placeholder not set)
Return type:Data
class TFNetworkRecLayer.GetRecAccumulatedOutputLayer(sub_layer, **kwargs)[source]

For RecLayer with a subnet. If some layer is explicitly marked as an additional output layer (via ‘is_output_layer’: True), you can get that subnet layer output via this accessor. Retrieves the accumulated output.

Parameters:sub_layer (str) – layer of subnet in RecLayer source, which has ‘is_output_layer’: True
layer_class = 'get_rec_accumulated'[source]
classmethod get_out_data_from_opts(name, sources, sub_layer, **kwargs)[source]
Parameters:
  • name (str) –
  • sources (list[LayerBase]) –
  • sub_layer (str) –
Return type:

Data

class TFNetworkRecLayer.ChoiceLayer(beam_size, input_type='prob', explicit_search_source=None, length_normalization=True, scheduled_sampling=False, cheating=False, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

Parameters:
  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, …)
  • input_type (str) – “prob” or “log_prob”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is.
  • explicit_search_source (LayerBase|None) – will mark it as an additional dependency
  • scheduled_sampling (dict|None) –
  • length_normalization (bool) – evaluates score_t/len in search
  • cheating (bool) – if True, will always add the true target in the beam
layer_class = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
classmethod get_out_data_from_opts(target, network, beam_size, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters:kwargs – all the same kwargs as for self.__init__()
Returns:Data template (placeholder not set)
Return type:Data
classmethod get_rec_initial_extra_outputs(network, beam_size, **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(**kwargs)[source]
Returns:optional shapes for the tensors by get_rec_initial_extra_outputs
Return type:dict[str,tf.TensorShape]
get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
class TFNetworkRecLayer.DecideLayer(length_normalization=False, **kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

Parameters:length_normalization (bool) – performed on the beam scores
layer_class = 'decide'[source]
classmethod decide(src, output=None, name=None, length_normalization=False)[source]
Parameters:
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)
  • output (Data|None) –
  • name (str|None) –
  • length_normalization (bool) – performed on the beam scores
Returns:

best beam selected from input, e.g. shape (batch, time, dim)

Return type:

Data

classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015, https://arxiv.org/abs/1409.0473 * Luong, Stanford, Effective Approaches to Attention-based Neural Machine Translation, 2015, https://arxiv.org/abs/1508.04025

-> dot, general, concat, location attention; comparison to Bahdanau
Parameters:base (LayerBase) – encoder output to attend on
get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
get_base_weights()[source]

We can formulate most attentions as some weighted sum over the base time-axis.

Returns:the weighting of shape (batch, base_time), in case it is defined
Return type:tf.Tensor|None
get_base_weight_last_frame()[source]

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns:shape (batch,) -> float (number 0..1)
Return type:tf.Tensor
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

classmethod get_out_data_from_opts(name, base, n_out=None, **kwargs)[source]
Parameters:
  • name (str) –
  • n_out (int|None) –
  • base (LayerBase) –
Return type:

Data

class TFNetworkRecLayer.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]
Parameters:
  • base (LayerBase) – encoder output to attend on
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights
get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

class TFNetworkRecLayer.GenericAttentionLayer(weights, auto_squeeze=True, **kwargs)[source]

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer.

Parameters:
  • base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (…) + (n_out,)
  • weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|()
  • auto_squeeze (bool) – auto-squeeze any weight-axes with dim=1 away
layer_class = 'generic_attention'[source]
get_dep_layers()[source]
Returns:list of layers this layer depends on. normally this is just self.sources but e.g. the attention layer in addition has a base, etc.
Return type:list[LayerBase]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer The name get_layer might be misleading, as this should return an existing layer, or construct it if it does not exist yet. network.get_layer would just return an existing layer.

Will modify d inplace such that it becomes the kwargs for self.__init__(). Mostly leaves d as-is. This is used by TFNetwork.construct_from_dict(). It resolves certain arguments, e.g. it resolves the “from” argument which is a list of strings, to make it the “sources” argument in kwargs, with a list of LayerBase instances. Subclasses can extend/overwrite this. Usually the only reason to overwrite this is when some argument might be a reference to a layer which should be resolved.

classmethod get_out_data_from_opts(base, weights, auto_squeeze=True, **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.DotAttentionLayer(energy_factor=None, **kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

Parameters:
  • base (LayerBase) – encoder output to attend on. defines output-dim
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
  • energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
layer_class = 'dot_attention'[source]
class TFNetworkRecLayer.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

layer_class = 'concat_attention'[source]
class TFNetworkRecLayer.GenericWindowAttentionLayer(weights, window_size, **kwargs)[source]
layer_class = 'generic_window_attention'[source]
class TFNetworkRecLayer.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

Parameters:
  • window_size (int) – the window size where the Gaussian window will be applied on the base
  • std (float) – standard deviation for Gauss
  • inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
  • inner_size_step (float) – see inner_size above
layer_class = 'gauss_window_attention'[source]
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
Parameters:
  • name (str) –
  • n_out (int|None) –
  • base (LayerBase) –
Return type:

Data

class TFNetworkRecLayer.SelfAttentionLayer(num_heads, total_key_dim, forward_weights_init='glorot_uniform', attention_dropout=0.0, attention_left_only=False, **kwargs)[source]

Applies self-attention on the input. I.e., with input x, it will basically calculate

att(Q x, K x, V x),

where att is multi-head dot-attention for now, Q, K, V are matrices. The attention will be over the time-dimension. If there is no time-dimension, we expect to be inside a RecLayer; also, this is only valid with attention_to_past_only=True.

See also dot_product_attention here:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
Parameters:
  • num_heads (int) –
  • total_key_dim (int) –
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • attention_dropout (float) –
  • attention_left_only (bool) – will mask out the future. see Attention is all you need.
layer_class = 'self_attention'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(n_out, name, sources, **kwargs)[source]
Parameters:
  • n_out (int) –
  • name (str) –
  • sources (list[LayerBase]) –
Return type:

Data

classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, num_heads, total_key_dim, n_out, sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

classmethod get_rec_initial_extra_outputs_shape_invariants(num_heads, total_key_dim, n_out, sources, **kwargs)[source]
Returns:optional shapes for the tensors by get_rec_initial_extra_outputs
Return type:dict[str,tf.TensorShape]
class TFNetworkRecLayer.PositionalEncodingLayer(add_to_input=False, **kwargs)[source]

Provides positional encoding in the form of (batch, time, n_out), where n_out is the number of channels, if it is run outside a RecLayer, or (batch, n_out) if run inside a RecLayer, where it will depend on the current time frame.

Assumes one source input with a time dimension if outside a RecLayer. By default (“from” key not provided), it would either use “data”, or “:i”. With add_to_input, it will calculate x + input.

The positional encoding is the same as in Tensor2Tensor. See TFUtil.get_positional_encoding().

Parameters:add_to_input (bool) – will add the signal to the input
layer_class = 'positional_encoding'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
Parameters:
classmethod get_out_data_from_opts(name, network, add_to_input=False, sources=(), **kwargs)[source]
Parameters:
Return type:

Data

class TFNetworkRecLayer.KenLmStateLayer(lm_file, vocab_file=None, vocab_unknown_label='UNK', bpe_merge_symbol=None, input_step_offset=0, dense_output=False, debug=False, **kwargs)[source]

Get next word (or subword) each frame, accumulates string, keeps state of seen string so far, returns score (+log space, natural base e) of sequence, using KenLM (http://kheafield.com/code/kenlm/) (see TFKenLM). EOS (</s>) token must be used explicitly.

Parameters:
  • lm_file (str|()->str) – ARPA file or so. whatever KenLM supports
  • vocab_file (str|None) – if the inputs are symbols, this must be provided. see Vocabulary
  • vocab_unknown_label (str) – for the vocabulary
  • bpe_merge_symbol (str|None) – e.g. “@@” if you want to apply BPE merging
  • input_step_offset (int) – if provided, will consider the input only from this step onwards
  • dense_output (bool) – whether we output the score for all possible succeeding tokens
  • debug (bool) – prints debug info
layer_class = 'kenlm'[source]
recurrent = True[source]
classmethod get_out_data_from_opts(name, sources, vocab_file=None, vocab_unknown_label='UNK', dense_output=False, **kwargs)[source]

Gets a Data template (i.e. shape etc is set but not the placeholder) for our __init__ args. The purpose of having this as a separate classmethod is to be able to infer the shape information without having to construct the layer. This function should not create any nodes in the computation graph.

Parameters:kwargs – all the same kwargs as for self.__init__()
Returns:Data template (placeholder not set)
Return type:Data
classmethod get_rec_initial_extra_outputs(batch_dim, rec_layer, sources=(), **kwargs)[source]
Parameters:
Return type:

dict[str,tf.Tensor]

class TFNetworkRecLayer.BaseRNNCell(trainable=True, name=None, dtype=None, activity_regularizer=None, **kwargs)[source]

Extends rnn_cell.RNNCell by having explicit static attributes describing some properties.

get_input_transformed(x, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters:
  • x (tf.Tensor) – (time, batch, dim), or (batch, dim)
  • batch_dim (tf.Tensor|None) –
Returns:

like x, maybe other feature-dim

Return type:

tf.Tensor|tuple[tf.Tensor]

class TFNetworkRecLayer.RHNCell(num_units, is_training=None, depth=5, dropout=0.0, dropout_seed=None, transform_bias=None, batch_size=None)[source]

Recurrent Highway Layer. With optional dropout for recurrent state (fixed over all frames - some call this variational).

References:
https://github.com/julian121266/RecurrentHighwayNetworks/ https://arxiv.org/abs/1607.03474
Parameters:
  • num_units (int) –
  • is_training (bool|tf.Tensor|None) –
  • depth (int) –
  • dropout (float) –
  • dropout_seed (int) –
  • transform_bias (float|None) –
  • batch_size (int|tf.Tensor|None) –
output_size[source]

Integer or TensorShape: size of outputs produced by this cell.

state_size[source]

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_input_transformed(x, batch_dim=None)[source]
Parameters:x (tf.Tensor) – (time, batch, dim)
Returns:(time, batch, num_units * 2)
Return type:tf.Tensor
call(inputs, state)[source]
Parameters:
  • inputs (tf.Tensor) –
  • state (tf.Tensor) –
Returns:

(output, state)

Return type:

(tf.Tensor, tf.Tensor)

class TFNetworkRecLayer.BlocksparseLSTMCell(*args, **kwargs)[source]

Standard LSTM but uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

It uses our own wrapper, see TFNativeOp.init_blocksparse().

call(*args, **kwargs)[source]

The logic of the layer lives here.

Arguments:
inputs: input tensor(s). **kwargs: additional keyword arguments.
Returns:
Output tensor(s).
load_params_from_native_lstm(values_dict, session)[source]
Parameters:
  • session (tf.Session) –
  • values_dict (dict[str,numpy.ndarray]) –
class TFNetworkRecLayer.BlocksparseMultiplicativeMultistepLSTMCell(*args, **kwargs)[source]

Multiplicative LSTM with multiple steps, as in the OpenAI blocksparse paper. Uses OpenAI blocksparse kernels to support bigger matrices.

Refs:

call(*args, **kwargs)[source]

The logic of the layer lives here.

Arguments:
inputs: input tensor(s). **kwargs: additional keyword arguments.
Returns:
Output tensor(s).
class TFNetworkRecLayer.LayerNormVariantsLSTMCell(num_units, norm_gain=1.0, norm_shift=0.0, activation=<function tanh>, is_training=None, dropout=0.0, dropout_h=0.0, dropout_seed=None, with_concat=False, global_norm=True, global_norm_joined=False, per_gate_norm=False, cell_norm=True, cell_norm_in_output=True, hidden_norm=False, variance_epsilon=1e-12)[source]

LSTM unit with layer normalization and recurrent dropout

This LSTM cell can apply different variants of layer normalization:

1. Layer normalization as in the original paper: Ref: https://arxiv.org/abs/1607.06450 This can be applied by having:

all default params (global_norm=True, cell_norm=True, cell_norm_in_output=True)

2. Layer normalization for RNMT+: Ref: https://arxiv.org/abs/1804.09849 This can be applied by having:

all default params except - global_norm = False - per_gate_norm = True - cell_norm_in_output = False

3. TF official LayerNormBasicLSTMCell Ref: https://www.tensorflow.org/api_docs/python/tf/contrib/rnn/LayerNormBasicLSTMCell This can be reproduced by having:

all default params except - global_norm = False - per_gate_norm = True

4. Sockeye LSTM layer normalization implementations Ref: https://github.com/awslabs/sockeye/blob/master/sockeye/rnn.py

LayerNormLSTMCell can be reproduced by having:
all default params except - with_concat = False (just efficiency, no difference in the model)
LayerNormPerGateLSTMCell can be reproduced by having:
all default params except: (- with_concat = False) - global_norm = False - per_gate_norm = True
Recurrent dropout is based on:
https://arxiv.org/abs/1603.05118
Parameters:
  • num_units (int) – number of lstm units
  • norm_gain (float) – layer normalization gain value
  • norm_shift (float) – layer normalization shift (bias) value
  • activation – Activation function to be applied in the lstm cell
  • is_training (bool) – if True then we are in the training phase
  • dropout (float) – dropout rate, applied on cell-in (j)
  • dropout_h (float) – dropout rate, applied on hidden state (h) when it enters the LSTM (variational dropout)
  • dropout_seed (int) – used to create random seeds
  • with_concat (bool) – if True then the input and prev hidden state is concatenated for the computation. this is just about computation performance.
  • global_norm (bool) – if True then layer normalization is applied for the forward and recurrent outputs (separately).
  • global_norm_joined (bool) – if True, then layer norm is applied on LSTM in (forward and recurrent output together)
  • per_gate_norm (bool) – if True then layer normalization is applied per lstm gate
  • cell_norm (bool) – if True then layer normalization is applied to the LSTM new cell output
  • cell_norm_in_output (bool) – if True, the normalized cell is also used in the output
  • hidden_norm (bool) – if True then layer normalization is applied to the LSTM new hidden state output
output_size[source]

Integer or TensorShape: size of outputs produced by this cell.

state_size[source]

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

get_input_transformed(inputs, batch_dim=None)[source]

Usually the cell itself does the transformation on the input. However, it would be faster to do it outside the recurrent loop. This function will get called outside the loop.

Parameters:
  • x (tf.Tensor) – (time, batch, dim), or (batch, dim)
  • batch_dim (tf.Tensor|None) –
Returns:

like x, maybe other feature-dim

Return type:

tf.Tensor|tuple[tf.Tensor]