`returnn.datasets.multi_proc`¶

Multi-processing dataset

class returnn.datasets.multi_proc.MultiProcDataset(dataset: Dict[str, Any], num_workers: int, buffer_size: int, _meta_info_cache: Dict[str, Any] | None = None, **kwargs)[source]¶

Dataset which uses multi-processing to load the data from another dataset.

To get deterministic behavior, it will use round-robin scheduling.

There is one process just for generating the sequence order, i.e. list of sequences. Then there are num_workers processes which will load the data for the shard of the sequences. This means, one epoch (or subepoch) is exactly as in the original dataset.

Parameters:

dataset – the dataset to use
num_workers – number of workers to use
buffer_size – buffer size for each worker, amount of seqs to prefetch
_meta_info_cache – for internal use

initialize()[source]¶: init

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

Parameters:

seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

property num_seqs: int[source]¶: num seqs

get_total_num_seqs() → int[source]¶: total num seqs

finish_epoch(*, free_resources: bool = False)[source]¶: finish epoch

returnn.datasets.multi_proc¶

`returnn.datasets.multi_proc`¶