Datasets#

All datasets in RETURNN are based on the returnn.datasets.Dataset, and most are also based on either returnn.datasets.cached.CachedDataset or returnn.datasets.cached2.CachedDataset2. The common parameters that can be used across most datasets are:

  • partition_epoch: split the data into smaller parts per epoch

  • seq_ordering: define the sequence ordering of the data.

Possible values for the sequence ordering are:

  • default: Keep the sequences as is

  • reverse: Use the default sequences in reversed order

  • random: Shuffle the data with a predefined fixed seed

  • random:<seed>: Shuffle the data with the seed given

  • sorted: Sort by length (only if available), beginning with shortest sequences

  • sorted_reverse: Sort by length, beginning with longest sequences

  • laplace:<n_buckets>: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.

  • laplace:.<n_sequences>: As above, but the number of bins is chosen such that each bin contains roughly n sequences.

  • laplace:<n_buckets>:<seed>: A seed can be provided for both laplace versions, separated by an additional colon.

Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes. For details on the different sequence orderings, have a look at Dataset.get_seq_order_for_epoch(). Also check the sequence ordering possibilities with the MetaDataset.

class returnn.datasets.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]#

Bases: object

Base class for any dataset. This defines the dataset API.

Parameters:
  • name (str) – e.g. “train” or “eval”

  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.

  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk

  • chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”

  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.

  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.

  • random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.

  • partition_epoch (int|None) –

  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.

  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use

  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order

  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file

  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported

  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown

class returnn.datasets.cached.CachedDataset(cache_byte_size=0, **kwargs)[source]#

Bases: Dataset

Base class for datasets with caching. This is only used for the HDFDataset. Also see CachedDataset2.

Parameters:

cache_byte_size (int) –

class returnn.datasets.cached2.CachedDataset2(**kwargs)[source]#

Bases: Dataset

Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.

If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance

Parameters:
  • name (str) – e.g. “train” or “eval”

  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.

  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk

  • chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”

  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.

  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.

  • random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.

  • partition_epoch (int|None) –

  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.

  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use

  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order

  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file

  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported

  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown