returnn.datasets.basic
¶
This defines the base dataset class Dataset
.
- class returnn.datasets.basic.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None, _num_shards=1, _shard_index=0)[source]¶
Base class for any dataset. This defines the dataset API.
- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled
- static kwargs_update_from_config(config: Config, kwargs: Dict[str, Any])[source]¶
Update kwargs inplace from config
- Parameters:
config
kwargs – updates will be done inplace
- static get_default_kwargs_eval(config: Config) Dict[str, Any] [source]¶
- Parameters:
config
- Returns:
default kwargs for an eval dataset based on the config
- classmethod from_config(config: Config, **kwargs) Dataset [source]¶
- Parameters:
config
kwargs – passed on to __init__
- Returns:
new dataset via cls(…)
- set_file_cache(cache: FileCache)[source]¶
Stores the given file cache with the dataset to unregister the files within the cache when the dataset is deinitialized.
- is_cached(start, end)[source]¶
- Parameters:
start (int) – like in load_seqs(), sorted seq idx
end (int) – like in load_seqs(), sorted seq idx
- Return type:
bool
:returns whether we have the full range (start,end) of sorted seq idx.
- get_seq_length(seq_idx: int) NumbersDict [source]¶
- Parameters:
seq_idx
:returns the len of the input features and the len of the target sequence.
- get_estimated_seq_length(seq_idx)[source]¶
In contrast to self.get_seq_length(), this method is designed to work for sequences that have not been loaded yet via self.load_seqs(). Used by meta-datasets for sequence ordering. Currently we only provide one number, i.e. do not give different estimates for the different data keys (as in get_seq_length()). It is up to the dataset what this number represents and how it is computed.
- Parameters:
seq_idx (int) – for current epoch, not the corpus seq idx
- Return type:
int
:returns sequence length estimate (for sorting)
- load_seqs(start, end)[source]¶
Load data sequences, such that self.get_data() & friends can return the data.
- Parameters:
start (int) – start sorted seq idx, inclusive
end (int) – end sorted seq idx, exclusive
- get_seq_order_for_epoch(epoch: int | None, num_seqs: int, get_seq_len: Callable[[int], int] | None = None) Sequence[int] [source]¶
Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.
- Parameters:
epoch – for ‘random’, this determines the random seed
num_seqs
get_seq_len – function (originalSeqIdx: int) -> int
- Returns:
the order for the given epoch. such that seq_idx -> underlying idx
- supports_sharding() bool [source]¶
- Returns:
whether the dataset supports sharding based on the seq_order
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).
- Return type:
bool
:returns whether the order changed (True is always safe to return)
This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.
- finish_epoch(*, free_resources: bool = False)[source]¶
This would get called at the end of the epoch (currently optional only). After this, further calls to
get_data()
orload_seqs()
are invalid, until a new call toinit_seq_order()
follows.
- get_current_seq_order()[source]¶
- Returns:
many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.
- Return type:
Sequence[int]
- supports_seq_order_sorting() bool [source]¶
- Returns:
whether “sorted” or “sorted_reverse” is supported for seq_ordering
- initialize()[source]¶
Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.
- get_data(seq_idx, key) ndarray [source]¶
- Parameters:
seq_idx (int) – sorted seq idx
key (str) – data-key, e.g. “data” or “classes”
- Returns:
features or targets: format 2d (time,feature) (float)
- get_input_data(sorted_seq_idx: int) ndarray [source]¶
DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement
get_data()
, and users also should just useget_data()
.This default implementation assumes that there is a “data” data key, which is not necessarily true in all cases.
- Parameters:
sorted_seq_idx
- Returns features:
format 2d (time,feature) (float)
- get_targets(target: str, sorted_seq_idx: int) ndarray [source]¶
DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement
get_data()
, and users also should just useget_data()
.- Parameters:
target – data key
sorted_seq_idx
- Returns targets:
format 1d (time) (int: idx of output-feature)
- get_data_slice(seq_idx, key, start_frame, end_frame)[source]¶
- Parameters:
seq_idx (int)
key (str)
start_frame (int)
end_frame (int)
- Returns:
x[start_frame:end_frame], with x = get_data(seq_idx, key)
- Return type:
numpy.ndarray
- get_all_tags()[source]¶
- Returns:
list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.
- Return type:
list[str]
- get_total_num_seqs(*, fast: bool = False) int [source]¶
- Parameters:
fast – if True, might raise an exception if not possible to get fast.
- Returns:
total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.
- have_corpus_seq_idx()[source]¶
- Return type:
bool
- Returns:
whether you can call self.get_corpus_seq_idx()
- get_corpus_seq_idx(seq_idx)[source]¶
- Parameters:
seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
- Returns:
the sequence index as-is in the original corpus (as if you would have sorting=”default”). only defined if self.have_corpus_seq_idx()
- Return type:
int
- have_get_corpus_seq() bool [source]¶
- Returns:
whether you can call
get_corpus_seq()
- get_corpus_seq(corpus_seq_idx: int) DatasetSeq [source]¶
This function allows random access directly into the corpus. Only implement this if such random access is possible in a reasonable efficient way. This allows to write map-style wrapper datasets around such RETURNN datasets.
- Parameters:
corpus_seq_idx – corresponds to output of
get_corpus_seq_idx()
- Returns:
data
- classmethod generic_complete_frac(seq_idx, num_seqs)[source]¶
- Parameters:
seq_idx (int) – idx
num_seqs (int|None) – None if not available
- Returns:
Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
- get_complete_frac(sorted_seq_idx: int, *, allow_only_lr_suitable: bool = False) float | None [source]¶
Tries to calculate exactly how much of the current epoch is completed when having processed seq
sorted_seq_idx
.sorted_seq_idx
cannot be less than the seq index of the previously loaded seqs.- Parameters:
sorted_seq_idx – sorted seq idx
allow_only_lr_suitable – only return a value when that value is suitable/accurate enough to base LR scheduling on it. If false, this function will return an approximative value when the exact value cannot be calculated (due to unknown
num_seqs
). Approximative values can be appropriate for e.g. progress bars.
- Returns:
continuous value in (0, 1] which represents how much of the current epoch is completed after
sorted_seq_idx
. Ifallow_only_lr_suitable=True
, returnsNone
if the value cannot be calculated such that it is accurate enough for LR scheduling, and otherwises basesepoch_continuous
on it for any dynamic learning rate scheduling. Assorted_seq_idx
is monotonic, the return value is also guaranteed to be monotonic.
- property estimated_num_seqs[source]¶
- Returns:
estimated num seqs. does not have to be exact
- Return type:
int|None
- get_data_keys()[source]¶
- Returns:
all available data keys (for get_data and all other functions)
- Return type:
list[str]
- get_target_list()[source]¶
- Returns:
subset of
get_data_keys()
. target keys are usually not available during inference- Return type:
list[str]
- get_data_dim(key)[source]¶
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
number of classes, no matter if sparse or not
- Return type:
int
- get_data_dtype(key)[source]¶
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
dtype as str, e.g. “int32” or “float32”
- Return type:
str
- is_data_sparse(key)[source]¶
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
whether the data is sparse
- Return type:
bool
- get_data_shape(key: str) List[int] [source]¶
:returns get_data(*, key).shape[1:], i.e. num-frames excluded
- len_info(*, fast: bool = False) str [source]¶
- Returns:
string to present the user as information about our len.
- is_less_than_num_seqs(n)[source]¶
- Return type:
bool
:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.
- can_serialize_data(key: str) bool [source]¶
- Parameters:
key – e.g. “classes”
- Returns:
whether
serialize_data()
is implemented for this key
- serialize_data(key: str, data: ndarray) str [source]¶
In case you have a
Vocabulary
, just useVocabulary.get_seq_labels()
.- Parameters:
key – e.g. “classes”. self.labels[key] should be set
data (numpy.ndarray) – 0D or 1D
- Returns:
serialized data
- iterate_seqs(recurrent_net=True, used_data_keys=None)[source]¶
Takes chunking into consideration.
- Parameters:
recurrent_net (bool) – whether the order of frames matter
used_data_keys (set(str)|None)
- Returns:
generator which yields tuples (seq index, seq start, seq end)
- Return type:
list[(int,NumbersDict,NumbersDict)]
- get_start_end_frames_full_seq(seq_idx)[source]¶
- Parameters:
seq_idx (int)
- Returns:
(start,end) frame, taking context_window into account
- Return type:
- batch_set_generator_cache_whole_epoch()[source]¶
The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.
- Returns:
whether we should enable this cache
- Return type:
bool
- class returnn.datasets.basic.DatasetSeq(seq_idx: int, features, *, targets=None, seq_tag: str | None = None, complete_frac: float | None = None)[source]¶
Encapsulates all data for one sequence.
- Parameters:
seq_idx – sorted seq idx in the Dataset
features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)
targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)
seq_tag – sequence name / tag
complete_frac – continuous value in (0, 1] which represents how much of the current epoch has been consumed when this seq is processed
- returnn.datasets.basic.get_dataset_class(name: str | Type[Dataset]) Type[Dataset] | None [source]¶
- Parameters:
name (str|type)
- returnn.datasets.basic.init_dataset(kwargs: Dict[str, Any] | str | Callable[[], Dict[str, Any]] | Dataset, extra_kwargs: Dict[str, Any] | None = None, default_kwargs: Dict[str, Any] | None = None, *, parent_dataset: Dataset | None = None) Dataset [source]¶
- Parameters:
kwargs
extra_kwargs
default_kwargs
parent_dataset – if given, will adapt some of the default_kwargs (when not set)
- returnn.datasets.basic.extend_dataset_dict_from_parent_dataset(dataset_dict: Dict[str, Any], parent_dataset: Dataset | None) Dict[str, Any] [source]¶
- Parameters:
dataset_dict
parent_dataset
- Returns:
extended dataset_dict
- returnn.datasets.basic.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]¶
- Parameters:
config_str (str) – hdf-files, or “LmDataset:…” or so
config (returnn.config.Config|None) – optional, only for “sprint:…”
cache_byte_size (int|None) – optional, only for HDFDataset
- Return type:
- returnn.datasets.basic.convert_data_dims(data_dims, leave_dict_as_is=False)[source]¶
This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.
- Parameters:
data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally
leave_dict_as_is (bool)
- Return type:
dict[str,(int,int)|dict]
- :returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))
(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)
- returnn.datasets.basic.shapes_for_batches(batches: Sequence[Batch], *, data_keys: Sequence[str], dataset: Dataset | None = None, extern_data: TensorDict | None, enforce_min_len1: bool = False) Dict[str, List[int]] | None [source]¶
- Parameters:
batches
data_keys
dataset
extern_data – detailed data description
enforce_min_len1
- returnn.datasets.basic.set_config_extern_data_from_dataset(config, dataset)[source]¶
- Parameters:
config (returnn.config.Config)
dataset (Dataset)