class TFNetworkRecLayer.AttentionBaseLayer(base, **kwargs)[source]

This is the base class for attention. This layer would get constructed in the context of one single decoder step. We get the whole encoder output over all encoder frames (the base), e.g. (batch,enc_time,enc_dim), and some current decoder context, e.g. (batch,dec_att_dim), and we are supposed to return the attention output, e.g. (batch,att_dim).

Some sources: * Bahdanau, Bengio, Montreal, Neural Machine Translation by Jointly Learning to Align and Translate, 2015, * Luong, Stanford, Effective Approaches to Attention-based Neural Machine Translation, 2015,

-> dot, general, concat, location attention; comparison to Bahdanau
Parameters:base (LayerBase) – encoder output to attend on

From the base weights (see self.get_base_weights(), must return not None) takes the weighting of the last frame in the time-axis (according to sequence lengths).

Returns:shape (batch,) -> float (number 0..1)
Return type:tf.Tensor

We can formulate most attentions as some weighted sum over the base time-axis.

Returns:the weighting of shape (batch, base_time), in case it is defined
Return type:tf.Tensor|None
classmethod get_out_data_from_opts(base, n_out=None, **kwargs)[source]
Parameters:base (LayerBase) –
Return type:Data
classmethod transform_config_dict(d, network, get_layer)[source]
class TFNetworkRecLayer.ChoiceLayer(beam_size, input_type='prob', **kwargs)[source]

This layer represents a choice to be made in search during inference, such as choosing the top-k outputs from a log-softmax for beam search. During training, this layer can return the true label. This is supposed to be used inside the rec layer. This can be extended in various ways.

We present the scores in +log space, and we will add them up along the path. Assume that we get input (batch,dim) from a (log-)softmax. Assume that each batch is already a choice via search. In search with a beam size of N, we would output sparse (batch=N,) and scores for each.

  • beam_size (int) – the outgoing beam size. i.e. our output will be (batch * beam_size, ...)
  • input_type (str) – “prob” or “log”, whether the input is in probability space, log-space, etc
classmethod get_out_data_from_opts(target, network, beam_size, **kwargs)[source]
classmethod get_rec_initial_extra_outputs(network, beam_size, **kwargs)[source]
Return type:


layer_class = 'choice'[source]
class TFNetworkRecLayer.ConcatAttentionLayer(**kwargs)[source]

Additive attention / tanh-concat attention as similarity measure between base_ctx and source. This is used by Montreal, where as Stanford compared this to the dot-attention. The concat-attention is maybe more standard for machine translation at the moment.

layer_class = 'concat_attention'[source]
class TFNetworkRecLayer.DecideLayer(**kwargs)[source]

This is kind of the counter-part to the choice layer. This only has an effect in search mode. E.g. assume that the input is of shape (batch * beam, time, dim) and has search_sources set. Then this will output (batch, time, dim) where the beam with the highest score is selected. Thus, this will do a decision based on the scores. In will convert the data to batch-major mode.

classmethod decide(src, output=None, name=None)[source]
  • src (LayerBase) – with search_choices set. e.g. input of shape (batch * beam, time, dim)
  • output (Data|None) –
  • name (str|None) –

best beam selected from input, e.g. shape (batch, time, dim)

Return type:


classmethod get_out_data_from_opts(name, sources, network, **kwargs)[source]
Return type:


layer_class = 'decide'[source]
class TFNetworkRecLayer.DotAttentionLayer(**kwargs)[source]

Classic global attention: Dot-product as similarity measure between base_ctx and source.

layer_class = 'dot_attention'[source]
class TFNetworkRecLayer.GaussWindowAttentionLayer(window_size, std=1.0, inner_size=None, inner_size_step=0.5, **kwargs)[source]

Interprets the incoming source as the location (float32, shape (batch,)) and returns a gauss-window-weighting of the base around the location. The window size is fixed (TODO: but the variance can optionally be dynamic).

  • window_size (int) – the window size where the Gaussian window will be applied on the base
  • std (float) – standard deviation for Gauss
  • inner_size (int|None) – if given, the output will have an additional dimension of this size, where t is shifted by +/- inner_size_step around. e.g. [t-1,t-0.5,t,t+0.5,t+1] would be the locations with inner_size=5 and inner_size_step=0.5.
  • inner_size_step (float) – see inner_size above
classmethod get_out_data_from_opts(inner_size=None, **kwargs)[source]
layer_class = 'gauss_window_attention'[source]
class TFNetworkRecLayer.GetLastHiddenStateLayer(n_out, combine='concat', **kwargs)[source]

Will combine (concat or add or so) all the last hidden states from all sources.

  • n_out (int) – dimension. output will be of shape (batch, n_out)
  • combine (str) – “concat” or “add”
classmethod get_out_data_from_opts(n_out, **kwargs)[source]
layer_class = 'get_last_hidden_state'[source]
class TFNetworkRecLayer.GlobalAttentionContextBaseLayer(base_ctx, **kwargs)[source]
Parameters:base_ctx (LayerBase) – encoder output used to calculate the attention weights
classmethod transform_config_dict(d, network, get_layer)[source]
class TFNetworkRecLayer.RecLayer(unit='lstm', direction=None, input_projection=True, initial_state=None, max_seq_len=None, forward_weights_init=None, recurrent_weights_init=None, bias_init=None, **kwargs)[source]

Recurrent layer, has support for several implementations of LSTMs (via unit argument), see TensorFlow LSTM benchmark (, and also GRU, or simple RNN.

A subnetwork can also be given which will be evaluated step-by-step, which can use attention over some separate input, which can be used to implement a decoder in a sequence-to-sequence scenario.

  • unit (str|dict[str,dict[str]]) – the RNNCell/etc name, e.g. “nativelstm”. see comment below. alternatively a whole subnetwork, which will be executed step by step, and which can include “prev” in addition to “from” to refer to previous steps.
  • direction (int|None) – None|1 -> forward, -1 -> backward
  • input_projection (bool) – True -> input is multiplied with matrix. False only works if same input dim
  • initial_state (LayerBase|None) –
  • max_seq_len (int) – if unit is a subnetwork
  • forward_weights_init (str) – see TFUtil.get_initializer()
  • recurrent_weights_init (str) – see TFUtil.get_initializer()
  • bias_init (str) – see TFUtil.get_initializer()
static convert_cudnn_canonical_to_lstm_block(reader, prefix, target='lstm_block_wrapper/')[source]

This assumes CudnnLSTM currently, with num_layers=1, input_mode=”linear_input”, direction=’unidirectional’!

  • reader (tf.train.CheckpointReader) –
  • prefix (str) – e.g. “layer2/rec/”
  • target (str) – e.g. “lstm_block_wrapper/” or “rnn/lstm_cell/”

dict key -> value, {”.../kernel”: ..., ”.../bias”: ...} with prefix

Return type:


classmethod get_out_data_from_opts(unit, sources=(), initial_state=None, **kwargs)[source]
classmethod get_rnn_cell_class(name)[source]
Parameters:name (str) – cell name, minus the “Cell” at the end
Return type:() -> tensorflow.contrib.rnn.RNNCell
layer_class = 'rec'[source]
recurrent = True[source]
classmethod transform_config_dict(d, network, get_layer)[source]
  • d (dict[str]) – will modify inplace
  • network (TFNetwork.TFNetwork) –
  • -> LayerBase) get_layer (((str)) – function to get or construct another layer
class TFNetworkRecLayer.RnnCellLayer(n_out, unit, initial_state=None, unit_opts=None, **kwargs)[source]

Wrapper around tf.contrib.rnn.RNNCell. This will operate a single step, i.e. there is no time dimension, i.e. we expect a (batch,n_in) input, and our output is (batch,n_out).

  • n_out (int) – so far, only output shape (batch,n_out) supported
  • unit (str|tf.contrib.rnn.RNNCell) – e.g. “BasicLSTM” or “LSTMBlock”
  • initial_state (str|float|LayerBase) – see self.get_rec_initial_state()
  • unit_opts (dict[str]|None) – passed to the cell.__init__
classmethod get_hidden_state_size(n_out, unit, unit_opts=None, **kwargs)[source]
classmethod get_out_data_from_opts(n_out, name, sources=(), **kwargs)[source]
classmethod get_rec_initial_extra_outputs(**kwargs)[source]
layer_class = 'rnn_cell'[source]