returnn.torch.distributed#

torch.distributed utils

class returnn.torch.distributed.DistributedContext(options: Dict[str, Any])[source]#

This class setups some helper functions for torch distributed training

local_rank() int[source]#

local rank

local_size() int[source]#

local size

rank() int[source]#

global rank

size() int[source]#

global size

returnn.torch.distributed.get_ctx(config=None)[source]#
Parameters:

config (Config|None) –

Returns:

the global context if Torch distributed is enabled, or None otherwise. If we did not setup the context yet, it will automatically create it.

Return type:

DistributedContext|None

returnn.torch.distributed.get_device_ids()[source]#

It depends on the specific setup what to return here, how CUDA_VISIBLE_DEVICES is set up, etc. This is currently a reasonable assumption, but we might extend the logic later, or make it configurable.

returnn.torch.distributed.get_local_rank()[source]#

torch.distributed does not seem to provide a function for this. Via mpirun (OpenMPI), this env variable would be set. It should fail with an error otherwise.