returnn.datasets.cached2
¶
Provides CachedDataset2
.
- class returnn.datasets.cached2.CachedDataset2(**kwargs)[source]¶
Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.
If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance
- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).
- Return type:
bool
:returns whether the order changed (True is always safe to return)
This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.
- get_target_list()[source]¶
Target data keys are usually not available during inference. Overwrite this if your dataset is more custom.
- get_complete_frac(sorted_seq_idx, **kwargs)[source]¶
- Returns:
fractional completion value for the given sorted_seq_idx
- class returnn.datasets.cached2.SingleStreamPipeDataset(dim, ndim, sparse=False, dtype='float32')[source]¶
Producer: Gets data from somewhere / an external source, running in some thread. Consumer: The thread / code which calls load_seqs and get_data here.
- Parameters:
dim (int)
ndim (int)
sparse (bool)
dtype (str)