returnn.datasets.multi_proc

Multi-processing dataset

class returnn.datasets.multi_proc.MultiProcDataset(dataset: Dict[str, Any], num_workers: int, buffer_size: int, sharding_method: str = 'seq_order', _meta_info_cache: Dict[str, Any] | None = None, **kwargs)[source]

Dataset which uses multi-processing to load the data from another dataset.

To get deterministic behavior, it will use round-robin scheduling.

There is one process just for generating the sequence order, i.e. list of sequences. Then there are num_workers processes which will load the data for the shard of the sequences. This means, one epoch (or subepoch) is exactly as in the original dataset.

Parameters:
  • dataset – the dataset to use

  • num_workers – number of workers to use

  • buffer_size – buffer size for each worker, number of seqs to prefetch

  • sharding_method – which method to use for sharding the data across the worker procs. The default is seq_order, which fetches the full list of seq indices, and then distributes shards of that to the other workers. Can also be set to dedicated to enable a worker-index based sharding method. This is compatible with more types of datasets, in particular those that do not know their total number of segments upfront.

  • _meta_info_cache – for internal use

initialize()[source]

init

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

get_total_num_seqs(*, fast: bool = False) int[source]

total num seqs

get_all_tags()[source]

all tags

finish_epoch(*, free_resources: bool = False)[source]

finish epoch

get_data_keys() List[str][source]

data keys

get_data_dtype(key: str) str[source]
Returns:

dtype of key

is_data_sparse(key: str) bool[source]
Returns:

whether key is sparse