See our wiki for an overview of possible distributed training variants.

Multi GPU training with PyTorch

Main configuration option is torch_distributed.

Example: Just put torch_distributed = {} into the config. This will by default use PyTorch DistributedDataParallel.

See our wiki on distributed PyTorch.

Multi GPU training with TensorFlow

This is about multi GPU training with the TensorFlow backend.

We currently use Horovod. Please refer to the Horovod documentation. Horovod provides simple TensorFlow ops for allreduce, allgather and broadcast, which will internally use the best available method, i.e. either NCCL for direct GPU transfer (on a single node), or MPI for any kind of transfer, including multiple nodes in a cluster network. Horovod requires that you have a working MPI setup.

Also see

We also have partial support for tf.distributed, i.e. the official TF way for distributed computation. Please refer to our wiki for an overview of distributed TensorFlow.

Also see


If you want to use NCCL, make sure it’s installed and it can be found.

You need to install some MPI. If you are in a cluster environment, usually you have that already. Check that you can run mpirun.

You need to install to install Horovod. This usually can be installed via pip:

pip3 install horovod[tensorflow]

For further information, please refer to the Horovod documentation.


In general, please refer to the Horovod documentation.

RETURNN will try to use Horovod when you specify use_horovod = True in your config (or via command line argument).

The implementation in RETURNN is pretty straight forward and follows mostly the tutorial. Try to understand that to get a basic understanding about how it works.

Relevant RETURNN settings

  • use_horovod: bool should be True

  • horovod_reduce_type: str one of:

    • "grad" means that we reduce the gradient after every step, and then use the same summed gradient to update the model in each instance. This is the default.

    • "param" means that every instance will do an update individually and after some N number of steps, we synchronize the models. This reduces the amount of communication and should increase the speed. Also configure horovod_param_sync_step when you use this. This is currently the recommended value.

  • horovod_param_sync_step: int: if the reduce type is param, this will specify after how many update steps the model parameters will be synchronized (i.e. averaged) The default is 1, but the recommended value is 100.

  • horovod_param_sync_time_diff: float: alternative to horovod_param_sync_step, e.g. 100. (secs), default None. This might be more efficient.

  • horovod_scale_lr: bool: whether to multiply the lr by number of instances (False by default)

  • horovod_dataset_distribution: str one of:

    • "shard": uses sharding for the dataset (via batch_slice for FeedDictDataProvider) This is the default.

    • "random_seed_offset": sets the default random_seed_offset via the rank This is currently the recommended value.


You should use a fast dataset implementation, or use horovod_dataset_distribution = "random_seed_offset".

In case you do not use horovod_dataset_distribution = "random_seed_offset", we recommend to use HDFDataset (with cache_size = 0) in your config. You can use tools/ to convert any dataset into a HDF dataset.

You should use horovod_reduce_type = "param" and either horovod_param_sync_step = 100 or horovod_param_sync_time_diff = 100..

Single node, multiple GPUs

Example SGE qsub parameters:

-hard -l h_vmem=32G -l h_rt=80:00:00 -l gpu=4 -l qname='*1080*|*TITAN*' -l num_proc=8

Example MPI run:

mpirun -np 4 \
    -bind-to none -map-by slot \
    -mca pml ob1 -mca btl ^openib \
    python3 returnn/ ++use_horovod 1

Multiple nodes

Example SGE qsub parameters:

-hard -l h_vmem=15G -l h_rt=80:00:00 -l gpu=1 -l qname='*1080*|*TITAN*' -l num_proc=4 -pe mpi 8

You might need to fix your SSH settings:

Host cluster-*
    TCPKeepAlive yes
    ForwardAgent yes
    ForwardX11 yes
    Compression yes
    StrictHostKeyChecking no
    HashKnownHosts no

MPI run:

mpirun -np 8 \
    -bind-to none -map-by slot \
    -mca pml ob1 -mca btl ^openib \
    python3 returnn/ ++use_horovod 1

For testing, you might also try (via mpirun):

python3 returnn/demos/

Debugging / profiling / benchmarking

As a starting point, please refer to the Horovod documentation. E.g. the Horovod timeline feature might be helpful.

In some cases, the dataset can be a bottleneck (unless you use horovod_dataset_distribution = "random_seed_offset"). If that is the case, try to use HDFDataset. Look at this output at the end of an epoch:

train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)

Look at the computing time in particular. That numbers measures how much relative time was spend inside TF If this is below 90% or so, it means that you wasted some time elsewhere, e.g. the dataset loading.

Then, refer to the TensorFlow documentation about how to do basic benchmarking / profiling. E.g. the timeline feature might be helpful.

Also look through some of the reported RETURNN issues, e.g. issue #73.