All datasets in RETURNN are based on the
and most are also based on either
The common parameters that can be used across most datasets are:
partition_epoch: split the data into smaller parts per epoch
seq_ordering: define the sequence ordering of the data.
Possible values for the sequence ordering are:
default: Keep the sequences as is
reverse: Use the default sequences in reversed order
random: Shuffle the data with a predefined fixed seed
random:<seed>: Shuffle the data with the seed given
sorted: Sort by length (only if available), beginning with shortest sequences
sorted_reverse: Sort by length, beginning with longest sequences
laplace:<n_buckets>: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.
laplace:.<n_sequences>: As above, but the number of bins is chosen such that each bin contains roughly n sequences.
laplace:<n_buckets>:<seed>: A seed can be provided for both laplace versions, separated by an additional colon.
Note that not all sequence order modes are available for all datasets,
and some datasets may provide additional modes.
For details on the different sequence orderings, have a look at
Also check the sequence ordering possibilities with the MetaDataset.
Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)¶
Base class for any dataset. This defines the dataset API.
- name (str) – e.g. “train” or “eval”
- window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
- context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
- chunking (None|str|int|(int,int)|dict|(dict,dict)) – “chunk_size:chunk_step”
- seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
- random_seed_offset (int|None) –
- partition_epoch (int|None) –
- repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
- seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
- unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
- seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
- shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
- estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
Parameters: cache_byte_size (int) –
Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.
If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance