Also see the slides of our Interspeech 2020 tutorial about machine learning frameworks including RETURNN which explains the recurrency concept as well.

Recurrency := Anything which is defined by step-by-step execution, where current step depends on previous step, such as RNN, beam search, etc.

This is all covered by, which is a generic wrapper around tf.while_loop. It covers:

  • Definition of stochastic variables (the output classes itself but also latent variables) for either beam search or training (e.g. using ground truth values)
  • Automatic optimizations

The recurrent formula is defined in a way as it would be used for recognition. I.e. specifically you would define your output labels as stochastic variables, and their probability distribution. The automatic optimization will make this efficient for the case of training.

Also see Recurrent Sub-Networks for more about the usage of the

Stochastic variables

The layer to define them is The default behavior is:

  • In training, it will just return the ground truth values.
  • With search enabled (in recognition), it will do beam search.

Note that there can be multiple stochastic variables. Usually the output classes are one stochastic variable. But there can be additional stochastic variables, e.g. latent variables, e.g. for the segment boundaries or time position in a hard attention model.

For latent variables, you might want to perform search, while keeping the output labels fixed to the ground truth (“forced alignment”).

You might also want to perform search over the output labels in training, see Min expected risk training.

For details on how beam search is implemented, see Search.

For details about how to use it for recognition or generation, see Generation and Search.

Automatic optimization

The definition of the recurrent formula can have parts which are actually independent from the loop – maybe depending on the mode, e.g. in training. Automatic optimization will find parts of the formula (i.e. sub layers) which can be calculated independently from the loop, i.e. outside of the loop.

All layers are implemented in a way that they perform the same mathematical calculation whether they are inside the loop or outside.


network = {
    "input": {"class": "rec", "unit": "nativelstm2", "n_out": 20},  # encoder
    "input_last": {"class": "get_last_hidden_state", "from": "input", "n_out":40},

    "output": {"class": "rec", "from": [], "target": "classes", "unit": {  # decoder
        "embed": {"class":"linear", "activation":None, "from":"output", "n_out":10},
        "s": {"class": "rec", "unit": "nativelstm2", "n_out": 20, "from": "prev:embed", "initial_state": "base:input_last"},
        "p": {"class":"softmax", "from":"s", "target": "classes", "loss": "ce"},
        "output": {"class":"choice", "from":"p", "target":"classes", "beam_size":8}

In this example, in training:

  • output is using the ground truth values, i.e. independent of anything in the loop, and can be moved out.
  • embed depends on output, which is moved out, so it can also be calculated outside the loop.
  • s depends on embed, which is moved out, so it can also be calculated outside the loop. Note that s has some internal state, and in fact needs to be calculated recurrently. But because it can be calculated independently from the loop, it can make use of very efficient kernels: In this case, it uses our NativeLstm2 implementation.
  • p depends on s, and its loss calculation depends on the ground truth values, so it also can be calculated outside. This will result in a very efficient and parallel tf.matmul.

With search enabled, in recognition:

output depends on the probability distribution p. Effectively nothing can be moved out, because everything depends on each other. This is still as efficient as it possible can be. The output will use tf.nn.top_k internally.

This example also shows how one single definition of the network can be used for both training and recognition, and in a very efficient way.

Consider the Transformer as another example. The Transformer can be defined in a similar straight-forward way, using output for the output labels with In training, it will result naturally in the standard fully parallel training. In decoding, it is also as efficient as it possible can be.

Min expected risk training


  • Min expected WER training
  • Max expected BLEU training
  • Reinforcement learning

By default, would return the ground truth in training. However, this is flexible. In minimum expected risk training, you want to perform search also in training.

Example for min expected WER training:

"encoder": ...,

"output": {"class": "rec", "unit": { ...
    "output_prob": {"class": "softmax", "from": "readout", "target": "classes"},
    "output": {"class": "choice", "target": "classes", "beam_size": 4, "from": "output_prob", "initial_output": 0},
} ...}, # [T|’time:var:extern_data:classes’,B], int32, dim 1030, beam ’output’, beam size 4

"min_wer": {
    "class": "copy",
    "from": "",  # currently the syntax to enable search
    "loss": "expected_loss", # expect beam search results with beam scores
    "target": "classes",
    "loss_opts": {"loss": {"class": "edit_distance"}, "loss_kind": "error"}