returnn.datasets.multi_proc

Multi-processing dataset

class returnn.datasets.multi_proc.MultiProcDataset(dataset: Dict[str, Any], num_workers: int, buffer_size: int, _meta_info_cache: Dict[str, Any] | None = None, **kwargs)[source]

Dataset which uses multi-processing to load the data from another dataset.

To get deterministic behavior, it will use round-robin scheduling.

There is one process just for generating the sequence order, i.e. list of sequences. Then there are num_workers processes which will load the data for the shard of the sequences. This means, one epoch (or subepoch) is exactly as in the original dataset.

Parameters:
  • dataset – the dataset to use

  • num_workers – number of workers to use

  • buffer_size – buffer size for each worker, amount of seqs to prefetch

  • _meta_info_cache – for internal use

initialize()[source]

init

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

property num_seqs: int[source]

num seqs

get_total_num_seqs() int[source]

total num seqs

finish_epoch(*, free_resources: bool = False)[source]

finish epoch