Basic Usage

Install RETURNN, Installation.

Now is the main entry point. Usage:

./ <config-file> [other-params]

where config-file is a config file for RETURNN. See here for an example, and many more examples from the demos. The configuration syntax can be in three different forms:

  • a simple line-based file with key value pairs
  • a JSON file (determined by a “{” at the beginning of the file)
  • executable python code (determined by a “#!” at the beginning of the file)

Config files using the python code syntax are the de-facto standard for all current examples and setups. The parameters can be set by defining global variables, but it is possible to use any form of python code such as functions and classes to construct your network or fill in global variables based on more complex decisions. The python syntax config files may also contain additional code such as layer or dataset definitions.

When calling will execute some task, such as train, forward or search.

The task train will train a model specified by a given network structure. After training each epoch on provided `training data, the current parameters will be stored to a model checkpoint file. Besides the training data, a development dataset is used to evaluate the current model, and store the evaluation results in a separate file.

The task forward will run a forward pass of the network, given an evaluation dataset, and store the results in an HDF file.

The task search is used to run the network with the beam-search algorithm. The results are serialized into text form and stored in a plain text file python dictionary format file.

The following parameters are very common, and are used in most RETURNN config files:

The task, such as train, forward or search.
E.g. gpu or cpu. Although RETURNN will automatically detect and use a GPU if available, a specific device can be enforced by setting this parameter.
If you set this to True, the TensorFlow will be used. Otherwise, the installed backend is used. If both backends are installed (TensorFlow and Theano), RETURNN will use Theano as default for legacy reasons.
train / dev / eval

The datasets parameters are set to a python dict with a mandatory entry class. The class attribute needs to be set to the class name of the dataset that should be used. An overview over available datasets can be found here. train and dev are used during training, while eval is usually used to define the dataset for the forward or search task.

Beside passing the constructor parameters to the specficic Dataset, there are some common parameters such as:

seq_ordering: This defines the order of the sequences provided by the dataset. Possible values are:

  • default: Keep the sequences as is
  • reverse: Use the default sequences in reversed order
  • random: Shuffle the data with a predefined fixed seed
  • random:<seed>: Shuffle the data with the seed given
  • sorted: Sort by length (only if available), beginning with shortest sequences
  • sorted_reverse: Sort by length, beginning with longest sequences
  • laplace:<n_buckets>: Sort by length with n laplacian buckets (one bucket means going from shortest to longest and back with 1/n of the data).
  • laplace:.<n_sequences>: sort by length with n sequences per laplacian bucket.

Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes.


Defines the source/target dimensions of the data. Both can be integers. extern_data can also be a dict if your dataset has other data streams. The standard source data is called “data” by default, and the standard target data is called “classes” by default. You can also specify whether your data is dense or sparse (i.e. it is just the index), which is specified by the number of dimensions, i.e. 2 (time-dim + feature-dim) or 1 (just time-dim). When using no explicit definition, it is assumed that the data contains a time axis.

Example: extern_data = {"data": [100, 2], "classes": [5000, 1]}. This defines an input dimension of 100, and the input is dense (2), and an output dimension of 5000, and the output provided by the dataset is sparse (1).

For a more explicit definition of the shapes, you can provide a dict instead of a list or tuple. This dict may contain information to create “Data” objects. For extern_data, only dim and shape are required. Example: 'feature_data': {'dim': 80, 'shape': (None, 80)} This defines 80 dimensional features with a time axis of arbitrary length. Example: 'speaker_classes': {'dim': 1172, 'shape': (), 'sparse': True} This defines a sparse input for e.g. speaker classes that do not have a time axis.

In general, all input parameters to Data can be provided


This is a dict which defines the network topology. It consists of layer-names as strings, mapped on dicts, which defines the layers. The layer dict consists of keys as strings and the value type depends on the key. The layer dict should contain the key class which defines the class or type of the layer, such as hidden for a feed-forward layer, rec for a recurrent layer (including LSTM) or softmax for the output layer (doesn’t need to have the softmax activation). Usually it also contains the key n_out which defines the feature-dimension of the output of this layer, and the key from which defines the inputs to this layer, which is a list of other layers. If you omit from, it will automatically pass in the input data from the dataset. All layer dict keys are passed to the layer class __init__, so you have to refer to the code for all details.

Example of a 3 layer bidirectional LSTM:

network = {
"lstm0_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1 },
"lstm0_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": -1 },

"lstm1_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm0_fw", "lstm0_bw"] },
"lstm1_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": -1, "from" : ["lstm0_fw", "lstm0_bw"] },

"lstm2_fw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": 1, "from" : ["lstm1_fw", "lstm1_bw"] },
"lstm2_bw" : { "class": "rec", "unit": "lstm", "n_out" : 500, "dropout": 0.1, "L2": 0.01, "direction": -1, "from" : ["lstm1_fw", "lstm1_bw"] },

"output" :   { "class" : "softmax", "loss" : "ce", "from" : ["lstm2_fw", "lstm2_bw"] }

See API or the code itself for documentation of the arguments for each layer class type. The rec layer class in particular supports a wide range of arguments, and several units which can be used, e.g. you can choose between different LSTM implementations, or GRU, or standard RNN, etc. See TFNetworkRecLayer.RecLayer or NetworkRecurrentLayer.RecurrentUnitLayer. See also TensorFlow LSTM Benchmark.

The total number of frames. A mini-batch has at least a time-dimension and a batch-dimension (or sequence-dimension), and depending on dense or sparse, also a feature-dimension. batch_size is the upper limit for time * sequences during creation of the mini-batches.
The maximum number of sequences in one mini-batch.
The learning rate during training, e.g. 0.01.
adam / nadam / …
E.g. set adam = True to enable the Adam optimization during training. See in for many more.
Defines the model file where RETURNN will save all model params after an epoch of training. For each epoch, it will suffix the filename by the epoch number. When running forward or search, the specified model will be loaded. The epoch can then be selected with the paramter load_epoch.
The number of epochs to train.
An integer. Common values are 3 or 4. Starting with 5, you will get an output per mini-batch.

There are much more parameters, and more details to many of the listed ones. Details on the parameters can be found in the parameter reference. As the reference is still incomplete, please watch out for additional parameters that can be found in the code.

All configuration params can also be passed as command line parameters. The generic form is ++param value, but more options are available. Please See the code for some usage.

See Technological Overview for more details and an overview how it all works.