returnn.tf.horovod#

Here we encapsulate some common Horovod functions.

Note that you are supposed to be able to import this module even if Horovod is not installed.

The usage of this module / global context is also considered optional at this point. Horovod is enabled <==> use_horovod is enabled in the config.

For relevant further config options, see the code of HorovodContext below. Most importantly:

  • horovod_dataset_distribution, recommended value "random_seed_offset", default value "shard"

  • horovod_reduce_type, recommended value "param", default value "grad"

  • horovod_param_sync_step, recommended value 100, default value 1

  • horovod_param_sync_time_diff, alternative to horovod_param_sync_step, e.g. 100. (secs), default None

Also see multi_gpu. Also see TFDistributed.

class returnn.tf.horovod.HorovodContext(config)[source]#

This setups some helper functions.

Parameters:

config (Config) –

should_sync_every_step()[source]#
Returns:

whether we should sync every step. This is both for the signal for more data, and also loss/error/score reduction.

Return type:

bool

get_reduce_type()[source]#
Return type:

str

is_reduce_type_grad()[source]#
Return type:

bool

is_reduce_type_param()[source]#
Return type:

bool

get_param_sync_time_diff()[source]#
Return type:

float|None

get_param_sync_step()[source]#
Return type:

int

get_dataset_distribution_type()[source]#
Return type:

str

is_dataset_distribution_shard()[source]#
Return type:

bool

get_dataset_shard_batch_slice()[source]#
Return type:

slice

is_dataset_distribution_random_seed_offset()[source]#
Return type:

bool

rank()[source]#
Return type:

int

size()[source]#
Return type:

int

local_rank()[source]#
Return type:

int

local_size()[source]#
Return type:

int

returnn.tf.horovod.get_ctx(config=None)[source]#
Parameters:

config (Config|None) –

Returns:

the global context if Horovod is enabled, or None otherwise. If we did not setup the context yet, it will automatically create it.

Return type:

HorovodContext|None