`returnn.datasets.basic`¶

This defines the base dataset class Dataset.

returnn.datasets.basic.random() → x in the interval [0, 1).[source]¶

class returnn.datasets.basic.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None, _num_shards=1, _shard_index=0)[source]¶

Base class for any dataset. This defines the dataset API.

Parameters:

name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled

static kwargs_update_from_config(config: Config, kwargs: Dict[str, Any])[source]¶

Update kwargs inplace from config

Parameters:

config
kwargs – updates will be done inplace

static get_default_kwargs_eval(config: Config) → Dict[str, Any][source]¶

Parameters:: config
Returns:: default kwargs for an eval dataset based on the config

classmethod from_config(config: Config, **kwargs) → Dataset[source]¶

Parameters:

config
kwargs – passed on to __init__

Returns:

new dataset via cls(…)

property random_seed_offset: int[source]¶

Returns:: random seed offset for shuffling

set_file_cache(cache: FileCache)[source]¶: Stores the given file cache with the dataset to unregister the files within the cache when the dataset is deinitialized.

is_cached(start, end)[source]¶

Parameters:

start (int) – like in load_seqs(), sorted seq idx
end (int) – like in load_seqs(), sorted seq idx

Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

get_seq_length(seq_idx: int) → NumbersDict[source]¶

Parameters:: seq_idx

:returns the len of the input features and the len of the target sequence.

get_estimated_seq_length(seq_idx)[source]¶

In contrast to self.get_seq_length(), this method is designed to work for sequences that have not been loaded yet via self.load_seqs(). Used by meta-datasets for sequence ordering. Currently we only provide one number, i.e. do not give different estimates for the different data keys (as in get_seq_length()). It is up to the dataset what this number represents and how it is computed.

Parameters:: seq_idx (int) – for current epoch, not the corpus seq idx
Return type:: int

:returns sequence length estimate (for sorting)

get_num_timesteps() → int | NumbersDict[source]¶

Returns:: how much frames we have in total.

load_seqs(start, end)[source]¶

Load data sequences, such that self.get_data() & friends can return the data.

Parameters:

start (int) – start sorted seq idx, inclusive
end (int) – end sorted seq idx, exclusive

get_seq_order_for_epoch(epoch: int | None, num_seqs: int, get_seq_len: Callable[[int], int | float] | None = None) → Sequence[int][source]¶

Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.

Parameters:

epoch – for ‘random’, this determines the random seed
num_seqs
get_seq_len – function (originalSeqIdx: int) -> int|float

Returns:

the order for the given epoch. such that seq_idx -> underlying idx

supports_sharding() → bool[source]¶

Returns:: whether the dataset supports sharding based on the seq_order

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

Parameters:

seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

finish_epoch(*, free_resources: bool = False)[source]¶: This would get called at the end of the epoch (currently optional only). After this, further calls to get_data() or load_seqs() are invalid, until a new call to init_seq_order() follows.

get_current_seq_order() → Sequence[int][source]¶

Returns:: many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.

supports_seq_order_sorting() → bool[source]¶

Returns:: whether “sorted” or “sorted_reverse” is supported for seq_ordering

initialize()[source]¶: Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

get_times(sorted_seq_idx)[source]¶

Parameters:: sorted_seq_idx (int)

get_data(seq_idx, key) → ndarray[source]¶

Parameters:

seq_idx (int) – sorted seq idx
key (str) – data-key, e.g. “data” or “classes”

Returns:

features or targets: format 2d (time,feature) (float)

get_input_data(sorted_seq_idx: int) → ndarray[source]¶

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

This default implementation assumes that there is a “data” data key, which is not necessarily true in all cases.

Parameters:: sorted_seq_idx
Returns features:: format 2d (time,feature) (float)

get_targets(target: str, sorted_seq_idx: int) → ndarray[source]¶

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

Parameters:

target – data key
sorted_seq_idx

Returns targets:

format 1d (time) (int: idx of output-feature)

get_data_slice(seq_idx, key, start_frame, end_frame)[source]¶

Parameters:

seq_idx (int)
key (str)
start_frame (int)
end_frame (int)

Returns:

x[start_frame:end_frame], with x = get_data(seq_idx, key)

Return type:

numpy.ndarray

get_tag(sorted_seq_idx: int) → str[source]¶

Parameters:: sorted_seq_idx

get_all_tags() → List[str][source]¶

Returns:: list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.

get_total_num_seqs(*, fast: bool = False) → int[source]¶

Parameters:: fast – if True, might raise an exception if not possible to get fast.
Returns:: total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.

have_corpus_seq_idx()[source]¶

Return type:: bool
Returns:: whether you can call self.get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx)[source]¶

Parameters:: seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:: the sequence index as-is in the original corpus (as if you would have sorting=”default”). only defined if self.have_corpus_seq_idx()
Return type:: int

have_get_corpus_seq() → bool[source]¶

Returns:: whether you can call get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) → DatasetSeq[source]¶

This function allows random access directly into the corpus. Only implement this if such random access is possible in a reasonable efficient way. This allows to write map-style wrapper datasets around such RETURNN datasets.

Parameters:: corpus_seq_idx – corresponds to output of get_corpus_seq_idx()
Returns:: data

classmethod generic_complete_frac(seq_idx, num_seqs)[source]¶

Parameters:

seq_idx (int) – idx
num_seqs (int|None) – None if not available

Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

get_complete_frac(sorted_seq_idx: int, *, allow_only_lr_suitable: bool = False) → float | None[source]¶

Tries to calculate exactly how much of the current epoch is completed when having processed seq sorted_seq_idx.

sorted_seq_idx cannot be less than the seq index of the previously loaded seqs.

Parameters:

sorted_seq_idx – sorted seq idx
allow_only_lr_suitable – only return a value when that value is suitable/accurate enough to base LR scheduling on it. If false, this function will return an approximative value when the exact value cannot be calculated (due to unknown num_seqs). Approximative values can be appropriate for e.g. progress bars.

Returns:

continuous value in (0, 1] which represents how much of the current epoch is completed after sorted_seq_idx. If allow_only_lr_suitable=True, returns None if the value cannot be calculated such that it is accurate enough for LR scheduling, and otherwises bases epoch_continuous on it for any dynamic learning rate scheduling. As sorted_seq_idx is monotonic, the return value is also guaranteed to be monotonic.

property num_seqs: int[source]¶

Returns:: num seqs for current epoch

property estimated_num_seqs[source]¶

Returns:: estimated num seqs. does not have to be exact
Return type:: int|None

get_data_keys()[source]¶

Returns:: all available data keys (for get_data and all other functions)
Return type:: list[str]

get_target_list()[source]¶

Returns:: subset of get_data_keys(). target keys are usually not available during inference
Return type:: list[str]

get_data_dim(key)[source]¶

Parameters:: key (str) – e.g. “data” or “classes”
Returns:: number of classes, no matter if sparse or not
Return type:: int

get_data_dtype(key)[source]¶

Parameters:: key (str) – e.g. “data” or “classes”
Returns:: dtype as str, e.g. “int32” or “float32”
Return type:: str

is_data_sparse(key)[source]¶

Parameters:: key (str) – e.g. “data” or “classes”
Returns:: whether the data is sparse
Return type:: bool

get_data_shape(key: str) → List[int][source]¶: :returns get_data(*, key).shape[1:], i.e. num-frames excluded

have_seqs() → bool[source]¶

Returns:: whether num_seqs > 0

len_info(*, fast: bool = False) → str[source]¶

Returns:: string to present the user as information about our len.

is_less_than_num_seqs(n)[source]¶

Return type:: bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

can_serialize_data(key: str) → bool[source]¶

Parameters:: key – e.g. “classes”
Returns:: whether serialize_data() is implemented for this key

serialize_data(key: str, data: ndarray) → str[source]¶

This is deprecated, as this is slow! In case you have a Vocabulary, just use Vocabulary.get_seq_labels() or Vocabulary.serialize_labels().

Parameters:

key – e.g. “classes”. self.labels[key] should be set
data (numpy.ndarray) – 0D or 1D

Returns:

serialized data

iterate_seqs(recurrent_net=True, used_data_keys=None)[source]¶

Takes chunking into consideration.

Parameters:

recurrent_net (bool) – whether the order of frames matter
used_data_keys (set(str)|None)

Returns:

generator which yields tuples (seq index, seq start, seq end)

Return type:

list[(int,NumbersDict,NumbersDict)]

get_start_end_frames_full_seq(seq_idx)[source]¶

Parameters:: seq_idx (int)
Returns:: (start,end) frame, taking context_window into account
Return type:: (NumbersDict,NumbersDict)

sample(seq_idx)[source]¶

Parameters:: seq_idx (int)
Return type:: bool

batch_set_generator_cache_whole_epoch()[source]¶

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:: whether we should enable this cache
Return type:: bool

generate_batches(shuffle_batches=False, **kwargs)[source]¶

Parameters:

shuffle_batches (bool)
kwargs – will be passed to _generate_batches()

Return type:

BatchSetGenerator

classmethod index_shape_for_batches(batches, data_key='data')[source]¶

Parameters:

batches (list[EngineBatch.Batch])
data_key (str)

Returns:

shape as (time, batch)

Return type:

(int, int)

class returnn.datasets.basic.DatasetSeq(seq_idx: int, features, *, targets=None, seq_tag: str | None = None, complete_frac: float | None = None)[source]¶

Encapsulates all data for one sequence.

Parameters:

seq_idx – sorted seq idx in the Dataset
features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)
targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)
seq_tag – sequence name / tag
complete_frac – continuous value in (0, 1] which represents how much of the current epoch has been consumed when this seq is processed

property num_frames[source]¶

Return type:: NumbersDict

get_data(key)[source]¶

Parameters:: key (str)
Return type:: numpy.ndarray

get_data_keys()[source]¶

Return type:: set[str]

returnn.datasets.basic.get_dataset_class(name: str | Type[Dataset]) → Type[Dataset] | None[source]¶

Parameters:: name (str|type)

Parameters:

kwargs
extra_kwargs
default_kwargs
parent_dataset – if given, will adapt some of the default_kwargs (when not set)

returnn.datasets.basic.extend_dataset_dict_from_parent_dataset(dataset_dict: Dict[str, Any], parent_dataset: Dataset | None) → Dict[str, Any][source]¶

Parameters:

dataset_dict
parent_dataset

Returns:

extended dataset_dict

returnn.datasets.basic.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]¶

Parameters:

config_str (str) – hdf-files, or “LmDataset:…” or so
config (returnn.config.Config|None) – optional, only for “sprint:…”
cache_byte_size (int|None) – optional, only for HDFDataset

Return type:

Dataset

returnn.datasets.basic.convert_data_dims(data_dims, leave_dict_as_is=False)[source]¶

This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.

Parameters:

data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally
leave_dict_as_is (bool)

Return type:

dict[str,(int,int)|dict]

:returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse)): (or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)

returnn.datasets.basic.shapes_for_batches(batches: Sequence[Batch], *, data_keys: Sequence[str], dataset: Dataset | None = None, extern_data: TensorDict | None, enforce_min_len1: bool = False) → Dict[str, List[int]] | None[source]¶

Parameters:

batches
data_keys
dataset
extern_data – detailed data description
enforce_min_len1

returnn.datasets.basic.set_config_extern_data_from_dataset(config, dataset)[source]¶

Parameters:

config (returnn.config.Config)
dataset (Dataset)

returnn.datasets.basic¶

`returnn.datasets.basic`¶