`returnn.tf.horovod`¶

Here we encapsulate some common Horovod functions.

Note that you are supposed to be able to import this module even if Horovod is not installed.

The usage of this module / global context is also considered optional at this point. Horovod is enabled <==> use_horovod is enabled in the config.

For relevant further config options, see the code of HorovodContext below. Most importantly:

horovod_dataset_distribution, recommended value "random_seed_offset", default value "shard"
horovod_reduce_type, recommended value "param", default value "grad"
horovod_param_sync_step, recommended value 100, default value 1
horovod_param_sync_time_diff, alternative to horovod_param_sync_step, e.g. 100. (secs), default None

Also see multi_gpu. Also see TFDistributed.

class returnn.tf.horovod.HorovodContext(config)[source]¶

This setups some helper functions.

should_sync_every_step()[source]¶

Returns:: whether we should sync every step. This is both for the signal for more data, and also loss/error/score reduction.
Return type:: bool

get_reduce_type()[source]¶

is_reduce_type_grad()[source]¶

is_reduce_type_param()[source]¶

get_param_sync_time_diff()[source]¶

get_param_sync_step()[source]¶

get_dataset_distribution_type()[source]¶

is_dataset_distribution_shard()[source]¶

get_dataset_shard_batch_slice()[source]¶

is_dataset_distribution_random_seed_offset()[source]¶

local_rank()[source]¶

local_size()[source]¶

returnn.tf.horovod.get_ctx(config=None)[source]¶

Parameters:: config (Config|None)
Returns:: the global context if Horovod is enabled, or None otherwise. If we did not setup the context yet, it will automatically create it.
Return type:: HorovodContext|None

returnn.tf.horovod¶