class TFNetworkRecLayer.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015, * Luong, Stanford, Effective Approaches to Attention-based Neural Machine Translation, 2015,

-> dot, general, concat, location attention; comparison to Bahdanau
Parameters:base (LayerBase) – encoder output to attend on

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns:shape (batch,) -> float (number 0..1)
Return type:tf.Tensor

We can formulate most attentions as some weighted sum over the base time-axis.

Returns:the weighting of shape (batch, base_time), in case it is defined
Return type:tf.Tensor|None
classmethod get_out_data_from_opts(base, n_out=None, **kwargs)[source]
Parameters:base (LayerBase) –
Return type:Data
classmethod transform_config_dict(d, network, get_layer)[source]
class TFNetworkRecLayer.ChoiceLayer(beam_size, input_type='prob', explicit_search_source=None, length_normalization=True, **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, ...)
  • input_type (str) – “prob” or “log_prob”, whether the input is in probability space, log-space, etc. or “regression”, if it is a prediction of the data as-is.
  • explicit_search_source (LayerBase|None) – will mark it as an additional dependency
  • length_normalization (bool) – evaluates score_t/len in search
classmethod get_out_data_from_opts(target, network, beam_size, **kwargs)[source]
classmethod get_rec_initial_extra_outputs(network, beam_size, **kwargs)[source]
Return type:


classmethod get_rec_initial_extra_outputs_shape_invariants(**kwargs)[source]
layer_class = 'choice'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
class TFNetworkRecLayer.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

layer_class = 'concat_attention'[source]
class TFNetworkRecLayer.DecideLayer(**kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

classmethod decide(src, output=None, name=None)[source]
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)
  • output (Data|None) –
  • name (str|None) –

best beam selected from input, e.g. shape (batch, time, dim)

Return type:


classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Return type:


layer_class = 'decide'[source]
class TFNetworkRecLayer.DotAttentionLayer(energy_factor=None, **kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

  • base (LayerBase) – encoder output to attend on. defines output-dim
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights, combined with input-data. dim must be equal to input-data
  • energy_factor (float|None) – the energy will be scaled by this factor. This is like a temperature for the softmax. In Attention-is-all-you-need, this is set to 1/sqrt(base_ctx.dim).
layer_class = 'dot_attention'[source]
class TFNetworkRecLayer.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

  • window_size (int) – the window size where the Gaussian window will be applied on the base
  • std (float) – standard deviation for Gauss
  • inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
  • inner_size_step (float) – see inner_size above
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
layer_class = 'gauss_window_attention'[source]
class TFNetworkRecLayer.GenericAttentionLayer(weights, **kwargs)[source]

The weighting for the base is specified explicitly here. This can e.g. be used together with SoftmaxOverSpatialLayer.

  • base (LayerBase) – encoder output to attend on. (B, enc-time)|(enc-time, B) + (n_out,)
  • weights (LayerBase) – attention weights. ((B, enc-time)|(enc-time, B)) + (1,)|()
layer_class = 'generic_attention'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

  • n_out (int) – dimension. output will be of shape (batch, n_out)
  • combine (str) – “concat” or “add”
classmethod get_out_data_from_opts(n_out, **kwargs)[source]
layer_class = 'get_last_hidden_state'[source]
class TFNetworkRecLayer.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]
  • base (LayerBase) – encoder output to attend on
  • base_ctx (LayerBase) – encoder output used to calculate the attention weights
classmethod transform_config_dict(d, network, get_layer)[source]
class TFNetworkRecLayer.RecLayer(unit='lstm', unit_opts=None, direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, optimize_move_layers_out=True, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (, and also GRU, or simple RNN.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario.

  • unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps.
  • unit_opts (None|dict[str]) – passed to RNNCell creation
  • direction (int|None) – None|1 -> forward, -1 -> backward
  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim
  • initial_state (LayerBase|str|float|int|tuple|None) –
  • max_seq_len (int) – if unit is a subnetwork
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str) – see TFUtil.get_initializer()
  • optimize_move_layers_out (bool) – will automatically move layers out of the loop when possible
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

  • reader (tf.train.CheckpointReader) –
  • prefix (str) – e.g. “layer2/rec/”
  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”

dict key -> value, {”.../kernel”: ..., ”.../bias”: ...} with prefix

Return type:


classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]
classmethod get_rnn_cell_class(name)[source]
Parameters:name (str) – cell name, minus the “Cell” at the end
Return type:() -> tensorflow.contrib.rnn.RNNCell
layer_class = 'rec'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
class TFNetworkRecLayer.RecStepInfoLayer(i, end_flag=None, seq_lens=None, **kwargs)[source]

Used by _SubnetworkRecCell. Represents the current step number.

  • i (tf.Tensor) – scalar, int32, current step (time)
  • end_flag (tf.Tensor|None) – (batch,), bool, says that the current sequence has ended
  • seq_lens (tf.Tensor|None) – (batch,) int32, seq lens
Returns:(batch,) of type bool. batch might include beam size
Return type:tf.Tensor
layer_class = ':i'[source]
class TFNetworkRecLayer.RnnCellLayer(n_out, unit, initial_state=None, unit_opts=None, weights_init='xavier', **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out). This is expected to be used inside a RecLayer.

  • n_out (int) – so far, only output shape (batch,n_out) supported
  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”
  • initial_state (str|float|LayerBase|tuple[LayerBase]|dict[LayerBase]) – see self._get_rec_initial_state(). This will be set via transform_config_dict(). To get the state from another recurrent layer, use the GetLastHiddenStateLayer (get_last_hidden_state).
  • unit_opts (dict[str]|None) – passed to the cell.__init__
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
Returns:size or tuple of sizes
Return type:int|tuple[int]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
classmethod get_rec_initial_extra_outputs(**kwargs)[source]
classmethod get_rec_initial_state(batch_dim, name, n_out, unit, initial_state=None, unit_opts=None, **kwargs)[source]

Very similar to get_rec_initial_output(). Initial hidden state when used inside a recurrent layer for the frame t=-1, if it is needed. As arguments, we get the usual layer arguments. batch_dim is added because it might be special because of beam search. Also see transform_config_dict() for initial_state.

Note: This could maybe share code with get_rec_initial_output(), although it is a bit more generic here because the state can also be a namedtuple or any kind of nested structure.

  • batch_dim (tf.Tensor) – including beam size in beam search
  • name (str) – layer name
  • n_out (int) – out dim
  • unit (str) – cell name
  • unit_opts (dict[str]|None) –
  • initial_state (LayerBase|str|int|float|None|list|tuple|namedtuple) – see code
Return type:


layer_class = 'rnn_cell'[source]
classmethod transform_config_dict(d, network, get_layer)[source]
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
static transform_initial_state(initial_state, network, get_layer)[source]
  • initial_state (str|float|int|list[str|float|int]|dict[str]|None) –
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer