Datasets¶
All datasets in RETURNN are based on the returnn.datasets.Dataset
,
and most are also based on either
returnn.datasets.cached.CachedDataset
or
returnn.datasets.cached2.CachedDataset2
.
The common parameters that can be used across most datasets are:
partition_epoch
: split the data into smaller parts per epochseq_ordering
: define the sequence ordering of the data.
Possible values for the sequence ordering are:
default
: Keep the sequences as isreverse
: Use the default sequences in reversed orderrandom
: Shuffle the data with a predefined fixed seedrandom:<seed>
: Shuffle the data with the seed givensorted
: Sort by length (only if available), beginning with shortest sequencessorted_reverse
: Sort by length, beginning with longest sequenceslaplace:<n_buckets>
: Shuffle the data and sort by length within each of n bins, each second bin is sorted in reverse.laplace:.<n_sequences>
: As above, but the number of bins is chosen such that each bin contains roughly n sequences.laplace:<n_buckets>:<seed>
: A seed can be provided for both laplace versions, separated by an additional colon.
Note that not all sequence order modes are available for all datasets,
and some datasets may provide additional modes.
For details on the different sequence orderings, have a look at Dataset.get_seq_order_for_epoch()
.
Also check the sequence ordering possibilities with the MetaDataset.
- class returnn.datasets.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None, _num_shards=1, _shard_index=0)[source]¶
Bases:
object
Base class for any dataset. This defines the dataset API.
- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled
- class returnn.datasets.cached.CachedDataset(cache_byte_size=0, **kwargs)[source]¶
Bases:
Dataset
Base class for datasets with caching. This is only used for the
HDFDataset
. Also seeCachedDataset2
.- Parameters:
cache_byte_size (int)
- class returnn.datasets.cached2.CachedDataset2(**kwargs)[source]¶
Bases:
Dataset
Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.
If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance
- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled