returnn.tf.horovod

Here we encapsulate some common Horovod functions.

Note that you are supposed to be able to import this module even if Horovod is not installed.

The usage of this module / global context is also considered optional at this point. Horovod is enabled <==> use_horovod is enabled in the config.

For relevant further config options, see the code of HorovodContext below. Most importantly:

  • horovod_dataset_distribution, recommended value "random_seed_offset", default value "shard"

  • horovod_reduce_type, recommended value "param", default value "grad"

  • horovod_param_sync_step, recommended value 100, default value 1

  • horovod_param_sync_time_diff, alternative to horovod_param_sync_step, e.g. 100. (secs), default None

Also see multi_gpu. Also see TFDistributed.

class returnn.tf.horovod.HorovodContext(config)[source]

This setups some helper functions.

Parameters:

config (Config)

should_sync_every_step()[source]
Returns:

whether we should sync every step. This is both for the signal for more data, and also loss/error/score reduction.

Return type:

bool

get_reduce_type()[source]
Return type:

str

is_reduce_type_grad()[source]
Return type:

bool

is_reduce_type_param()[source]
Return type:

bool

get_param_sync_time_diff()[source]
Return type:

float|None

get_param_sync_step()[source]
Return type:

int

get_dataset_distribution_type()[source]
Return type:

str

is_dataset_distribution_shard()[source]
Return type:

bool

get_dataset_shard_batch_slice()[source]
Return type:

slice

is_dataset_distribution_random_seed_offset()[source]
Return type:

bool

rank()[source]
Return type:

int

size()[source]
Return type:

int

local_rank()[source]
Return type:

int

local_size()[source]
Return type:

int

returnn.tf.horovod.get_ctx(config=None)[source]
Parameters:

config (Config|None)

Returns:

the global context if Horovod is enabled, or None otherwise. If we did not setup the context yet, it will automatically create it.

Return type:

HorovodContext|None