Datasets

All datasets in RETURNN are based on the Datset.Dataset, and most are also based on either CachedDataset.CachedDataset or CachedDataset2.CachedDataset2. The common parameters that can be used across most datasets are:

  • partition_epoch: split the data into smaller parts per epoch
  • seq_ordering: define the sequence ordering of the data.

Possible values for the sequence ordering are:

  • default: Keep the sequences as is
  • reverse: Use the default sequences in reversed order
  • random: Shuffle the data with a predefined fixed seed
  • random:<seed>: Shuffle the data with the seed given
  • sorted: Sort by length (only if available), beginning with shortest sequences
  • sorted_reverse: Sort by length, beginning with longest sequences
  • laplace:<n_buckets>: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.
  • laplace:.<n_sequences>: As above, but the number of bins is chosen such that each bin contains roughly n sequences.
  • laplace:<n_buckets>:<seed>: A seed can be provided for both laplace versions, separated by an additional colon.

Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes. For details on the different sequence orderings, have a look at Dataset.Dataset.get_seq_order_for_epoch(). Also check the sequence ordering possibilities with the MetaDataset.

class Dataset.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]

Bases: object

Base class for any dataset. This defines the dataset API.

Parameters:
  • name (str) – e.g. “train” or “eval”
  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
  • chunking (None|str|int|(int,int)|dict|(dict,dict)) – “chunk_size:chunk_step”
  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
  • random_seed_offset (int|None) –
  • partition_epoch (int|None) –
  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
class CachedDataset.CachedDataset(cache_byte_size=0, **kwargs)[source]

Bases: returnn.datasets.basic.Dataset

Parameters:cache_byte_size (int) –
class CachedDataset2.CachedDataset2(**kwargs)[source]

Bases: returnn.datasets.basic.Dataset

Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.

If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance