returnn.datasets.basic
#
This defines the base dataset class Dataset
.
- class returnn.datasets.basic.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]#
Base class for any dataset. This defines the dataset API.
- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None) –
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
- static get_default_kwargs_eval(config)[source]#
- Parameters:
config (returnn.config.Config) –
- Return type:
dict[str]
- classmethod from_config(config, **kwargs)[source]#
- Parameters:
kwargs (dict[str]) – passed on to __init__
- Return type:
- is_cached(start, end)[source]#
- Parameters:
start (int) – like in load_seqs(), sorted seq idx
end (int) – like in load_seqs(), sorted seq idx
- Return type:
bool
:returns whether we have the full range (start,end) of sorted seq idx.
- get_seq_length(seq_idx: int) NumbersDict [source]#
- Parameters:
seq_idx –
:returns the len of the input features and the len of the target sequence.
- get_estimated_seq_length(seq_idx)[source]#
In contrast to self.get_seq_length(), this method is designed to work for sequences that have not been loaded yet via self.load_seqs(). Used by meta-datasets for sequence ordering. Currently we only provide one number, i.e. do not give different estimates for the different data keys (as in get_seq_length()). It is up to the dataset what this number represents and how it is computed.
- Parameters:
seq_idx (int) – for current epoch, not the corpus seq idx
- Return type:
int
:returns sequence length estimate (for sorting)
- load_seqs(start, end)[source]#
Load data sequences, such that self.get_data() & friends can return the data.
- Parameters:
start (int) – start sorted seq idx, inclusive
end (int) – end sorted seq idx, exclusive
- get_seq_order_for_epoch(epoch, num_seqs, get_seq_len=None)[source]#
Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.
- Parameters:
epoch (int|None) – for ‘random’, this determines the random seed
num_seqs (int) –
get_seq_len (((int) -> int)|None) – function (originalSeqIdx: int) -> int
- Returns:
the order for the given epoch. such that seq_idx -> underlying idx
- Return type:
Sequence[int]
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
- Parameters:
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).
- Return type:
bool
:returns whether the order changed (True is always safe to return)
This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.
- finish_epoch()[source]#
This would get called at the end of the epoch (currently optional only). After this, further calls to
get_data()
orload_seqs()
are invalid, until a new call toinit_seq_order()
follows.
- get_current_seq_order()[source]#
- Returns:
many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.
- Return type:
Sequence[int]
- supports_seq_order_sorting() bool [source]#
- Returns:
whether “sorted” or “sorted_reverse” is supported for seq_ordering
- initialize()[source]#
Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.
- get_data(seq_idx, key) ndarray [source]#
- Parameters:
seq_idx (int) – sorted seq idx
key (str) – data-key, e.g. “data” or “classes”
- Returns:
features or targets: format 2d (time,feature) (float)
- get_input_data(sorted_seq_idx)[source]#
- Return type:
numpy.ndarray
- Returns features:
format 2d (time,feature) (float)
- get_targets(target, sorted_seq_idx)[source]#
- Parameters:
target (str) – data key
- Return type:
numpy.ndarray
- Returns targets:
format 1d (time) (int: idx of output-feature)
- get_data_slice(seq_idx, key, start_frame, end_frame)[source]#
- Parameters:
seq_idx (int) –
key (str) –
start_frame (int) –
end_frame (int) –
- Returns:
x[start_frame:end_frame], with x = get_data(seq_idx, key)
- Return type:
numpy.ndarray
- get_all_tags()[source]#
- Returns:
list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.
- Return type:
list[str]
- get_total_num_seqs() int [source]#
- Returns:
total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.
- have_corpus_seq_idx()[source]#
- Return type:
bool
- Returns:
whether you can call self.get_corpus_seq_idx()
- get_corpus_seq_idx(seq_idx)[source]#
- Parameters:
seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
- Returns:
the sequence index as-is in the original corpus (as if you would have sorting=”default”). only defined if self.have_corpus_seq_idx()
- Return type:
int
- have_get_corpus_seq() bool [source]#
- Returns:
whether you can call
get_corpus_seq()
- get_corpus_seq(corpus_seq_idx: int) DatasetSeq [source]#
This function allows random access directly into the corpus. Only implement this if such random access is possible in a reasonable efficient way. This allows to write map-style wrapper datasets around such RETURNN datasets.
- Parameters:
corpus_seq_idx – corresponds to output of
get_corpus_seq_idx()
- Returns:
data
- classmethod generic_complete_frac(seq_idx, num_seqs)[source]#
- Parameters:
seq_idx (int) – idx
num_seqs (int|None) – None if not available
- Returns:
Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
- get_complete_frac(seq_idx)[source]#
- Parameters:
seq_idx (int) –
- Returns:
Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
- Return type:
float
- property estimated_num_seqs[source]#
- Returns:
estimated num seqs. does not have to be exact
- Return type:
int|None
- get_data_keys()[source]#
- Returns:
all available data keys (for get_data and all other functions)
- Return type:
list[str]
- get_target_list()[source]#
- Returns:
subset of
get_data_keys()
. target keys are usually not available during inference- Return type:
list[str]
- get_data_dim(key)[source]#
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
number of classes, no matter if sparse or not
- Return type:
int
- get_data_dtype(key)[source]#
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
dtype as str, e.g. “int32” or “float32”
- Return type:
str
- is_data_sparse(key)[source]#
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
whether the data is sparse
- Return type:
bool
- get_data_shape(key: str) List[int] [source]#
:returns get_data(*, key).shape[1:], i.e. num-frames excluded
- len_info()[source]#
- Return type:
str
:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.
- is_less_than_num_seqs(n)[source]#
- Return type:
bool
:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.
- serialize_data(key, data)[source]#
In case you have a
Vocabulary
, just useVocabulary.get_seq_labels()
.- Parameters:
key (str) – e.g. “classes”. self.labels[key] should be set
data (numpy.ndarray) – 0D or 1D
- Return type:
str
- iterate_seqs(recurrent_net=True, used_data_keys=None)[source]#
Takes chunking into consideration.
- Parameters:
recurrent_net (bool) – whether the order of frames matter
used_data_keys (set(str)|None) –
- Returns:
generator which yields tuples (seq index, seq start, seq end)
- Return type:
list[(int,NumbersDict,NumbersDict)]
- get_start_end_frames_full_seq(seq_idx)[source]#
- Parameters:
seq_idx (int) –
- Returns:
(start,end) frame, taking context_window into account
- Return type:
- batch_set_generator_cache_whole_epoch()[source]#
The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.
- Returns:
whether we should enable this cache
- Return type:
bool
- class returnn.datasets.basic.DatasetSeq(seq_idx, features, targets=None, seq_tag=None)[source]#
Encapsulates all data for one sequence.
- Parameters:
seq_idx (int) – sorted seq idx in the Dataset
features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)
targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)
seq_tag (str) – sequence name / tag
- returnn.datasets.basic.get_dataset_class(name: str | Type[Dataset]) Type[Dataset] | None [source]#
- Parameters:
name (str|type) –
- returnn.datasets.basic.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]#
- Parameters:
config_str (str) – hdf-files, or “LmDataset:…” or so
config (returnn.config.Config|None) – optional, only for “sprint:…”
cache_byte_size (int|None) – optional, only for HDFDataset
- Return type:
- returnn.datasets.basic.convert_data_dims(data_dims, leave_dict_as_is=False)[source]#
This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.
- Parameters:
data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally
leave_dict_as_is (bool) –
- Return type:
dict[str,(int,int)|dict]
- :returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))
(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)
- returnn.datasets.basic.shapes_for_batches(batches: Sequence[Batch], *, data_keys: Sequence[str], dataset: Dataset | None = None, extern_data: TensorDict | None, enforce_min_len1: bool = False) Dict[str, List[int]] | None [source]#
- Parameters:
batches –
data_keys –
dataset –
extern_data – detailed data description
enforce_min_len1 –
- returnn.datasets.basic.set_config_extern_data_from_dataset(config, dataset)[source]#
- Parameters:
config (returnn.config.Config) –
dataset (Dataset) –