Multi GPU training¶
This is about multi GPU training with the TensorFlow backend.
We currently use Horovod. Please refer to the Horovod documentation. Horovod provides simple TensorFlow ops for allreduce, allgather and broadcast, which will internally use the best available method, i.e. either NCCL for direct GPU transfer (on a single node), or MPI for any kind of transfer, including multiple nodes in a cluster network. Horovod requires that you have a working MPI setup.
We also have partial support for
i.e. the official TF way for distributed computation.
Please refer to our wiki for an overview of distributed TensorFlow.
If you want to use NCCL, make sure it’s installed and it can be found.
You need to install some MPI.
If you are in a cluster environment, usually you have that already.
Check that you can run
You need to install to install Horovod. This usually can be installed via pip:
pip3 install horovod
For further information, please refer to the Horovod documentation.
In general, please refer to the Horovod documentation.
RETURNN will try to use Horovod when you specify
use_horovod = True
in your config (or via command line argument).
The implementation in RETURNN is pretty straight forward and follows mostly the tutorial. Try to understand that to get a basic understanding about how it works.
Relevant RETURNN settings¶
use_horovod: boolshould be
horovod_reduce_type: strone of:
"grad"means that we reduce the gradient after every step, and then use the same summed gradient to update the model in each instance. This is the default.
"param"means that every instance will do an update individually and after some N number of steps, we synchronize the models. This reduces the amount of communication and should increase the speed. Also configure
horovod_param_sync_stepwhen you use this. This is currently the recommended value.
horovod_param_sync_step: int: if the reduce type is param, this will specify after how many update steps the model parameters will be synchronized (i.e. averaged) The default is 1, but the recommended value is 100.
horovod_param_sync_time_diff: float: alternative to
None. This might be more efficient.
horovod_scale_lr: bool: whether to multiply the lr by number of instances (False by default)
horovod_dataset_distribution: strone of:
"shard": uses sharding for the dataset (via
FeedDictDataProvider) This is the default.
"random_seed_offset": sets the default
random_seed_offsetvia the rank This is currently the recommended value.
You should use a fast dataset implementation,
horovod_dataset_distribution = "random_seed_offset".
In case you do not use
horovod_dataset_distribution = "random_seed_offset",
we recommend to use
cache_size = 0) in your config.
You can use
tools/hdf_dump.py to convert any dataset into a HDF dataset.
You should use
horovod_reduce_type = "param"
horovod_param_sync_step = 100 or
horovod_param_sync_time_diff = 100..
Single node, multiple GPUs¶
-hard -l h_vmem=32G -l h_rt=80:00:00 -l gpu=4 -l qname='*1080*|*TITAN*' -l num_proc=8
Example MPI run:
mpirun -np 4 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG \ -mca pml ob1 -mca btl ^openib \ python3 returnn/rnn.py returnn-config.py ++use_horovod 1
-hard -l h_vmem=15G -l h_rt=80:00:00 -l gpu=1 -l qname='*1080*|*TITAN*' -l num_proc=4 -pe mpi 8
You might need to fix your SSH settings:
Host cluster-* TCPKeepAlive yes ForwardAgent yes ForwardX11 yes Compression yes StrictHostKeyChecking no HashKnownHosts no
mpirun -np 8 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -x HOROVOD_TIMELINE -x DEBUG \ -mca pml ob1 -mca btl ^openib \ python3 returnn/rnn.py returnn-config.py ++use_horovod 1
For testing, you might also try (via
Debugging / profiling / benchmarking¶
As a starting point, please refer to the Horovod documentation. E.g. the Horovod timeline feature might be helpful.
In some cases, the dataset can be a bottleneck
(unless you use
horovod_dataset_distribution = "random_seed_offset").
If that is the case, try to use
Look at this output at the end of an epoch:
train epoch 1, finished after 2941 steps, 0:28:58 elapsed (99.3% computing time)
Look at the
computing time in particular.
That numbers measures how much relative time was spend inside TF
If this is below 90% or so, it means that you wasted some time elsewhere,
e.g. the dataset loading.
Then, refer to the TensorFlow documentation about how to do basic benchmarking / profiling. E.g. the timeline feature might be helpful.