- class returnn.datasets.multi_proc.MultiProcDataset(dataset: Dict[str, Any], num_workers: int, buffer_size: int, _meta_info_cache: Dict[str, Any] | None = None, **kwargs)#
Dataset which uses multi-processing to load the data from another dataset.
To get deterministic behavior, it will use round-robin scheduling.
There is one process just for generating the sequence order, i.e. list of sequences. Then there are
num_workersprocesses which will load the data for the shard of the sequences. This means, one epoch (or subepoch) is exactly as in the original dataset.
dataset – the dataset to use
num_workers – number of workers to use
buffer_size – buffer size for each worker, amount of seqs to prefetch
_meta_info_cache – for internal use
- init_seq_order(epoch=None, seq_list=None, seq_order=None)#
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).
- Return type:
:returns whether the order changed (True is always safe to return)
- property num_seqs: int#
- get_total_num_seqs() int #
total num seqs