Datasets¶

All datasets in RETURNN are based on the returnn.datasets.Dataset, and most are also based on either returnn.datasets.cached.CachedDataset or returnn.datasets.cached2.CachedDataset2. The common parameters that can be used across most datasets are:

partition_epoch: split the data into smaller parts per epoch
seq_ordering: define the sequence ordering of the data.

Possible values for the sequence ordering are:

default: Keep the sequences as is
reverse: Use the default sequences in reversed order
random: Shuffle the data with a predefined fixed seed
random:<seed>: Shuffle the data with the seed given
sorted: Sort by length (only if available), beginning with shortest sequences
sorted_reverse: Sort by length, beginning with longest sequences
laplace:<n_buckets>: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.
laplace:.<n_sequences>: As above, but the number of bins is chosen such that each bin contains roughly n sequences.
laplace:<n_buckets>:<seed>: A seed can be provided for both laplace versions, separated by an additional colon.

Note that not all sequence order modes are available for all datasets, and some datasets may provide additional modes. For details on the different sequence orderings, have a look at Dataset.get_seq_order_for_epoch(). Also check the sequence ordering possibilities with the MetaDataset.

class returnn.datasets.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None, _num_shards=1, _shard_index=0)[source]¶

Bases: object

Base class for any dataset. This defines the dataset API.

Parameters:

name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled

class returnn.datasets.cached.CachedDataset(cache_byte_size=0, **kwargs)[source]¶

Bases: Dataset

Base class for datasets with caching. This is only used for the HDFDataset. Also see CachedDataset2.

Parameters:: cache_byte_size (int)

class returnn.datasets.cached2.CachedDataset2(**kwargs)[source]¶

Bases: Dataset

Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.

If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance

Parameters:

name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled