returnn.datasets.basic

This defines the base dataset class Dataset.

returnn.datasets.basic.random() x in the interval [0, 1).[source]
class returnn.datasets.basic.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', fixed_random_seed=None, random_seed_offset=None, partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]

Base class for any dataset. This defines the dataset API.

Parameters:
  • name (str) – e.g. “train” or “eval”

  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.

  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk

  • chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”

  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.

  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.

  • random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.

  • partition_epoch (int|None)

  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.

  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use

  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order

  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file

  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported

  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown

static kwargs_update_from_config(config: Config, kwargs: Dict[str, Any])[source]

Update kwargs inplace from config

Parameters:
  • config

  • kwargs – updates will be done inplace

static get_default_kwargs_eval(config: Config) Dict[str, Any][source]
Parameters:

config

Returns:

default kwargs for an eval dataset based on the config

classmethod from_config(config: Config, **kwargs) Dataset[source]
Parameters:
  • config

  • kwargs – passed on to __init__

Returns:

new dataset via cls(…)

is_cached(start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx

  • end (int) – like in load_seqs(), sorted seq idx

Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

get_seq_length(seq_idx: int) NumbersDict[source]
Parameters:

seq_idx

:returns the len of the input features and the len of the target sequence.

get_estimated_seq_length(seq_idx)[source]

In contrast to self.get_seq_length(), this method is designed to work for sequences that have not been loaded yet via self.load_seqs(). Used by meta-datasets for sequence ordering. Currently we only provide one number, i.e. do not give different estimates for the different data keys (as in get_seq_length()). It is up to the dataset what this number represents and how it is computed.

Parameters:

seq_idx (int) – for current epoch, not the corpus seq idx

Return type:

int

:returns sequence length estimate (for sorting)

get_num_timesteps()[source]
Return type:

int

load_seqs(start, end)[source]

Load data sequences, such that self.get_data() & friends can return the data.

Parameters:
  • start (int) – start sorted seq idx, inclusive

  • end (int) – end sorted seq idx, exclusive

get_seq_order_for_epoch(epoch: int | None, num_seqs: int, get_seq_len: Callable[[int], int] | None = None) Sequence[int][source]

Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.

Parameters:
  • epoch – for ‘random’, this determines the random seed

  • num_seqs

  • get_seq_len – function (originalSeqIdx: int) -> int

Returns:

the order for the given epoch. such that seq_idx -> underlying idx

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

finish_epoch(*, free_resources: bool = False)[source]

This would get called at the end of the epoch (currently optional only). After this, further calls to get_data() or load_seqs() are invalid, until a new call to init_seq_order() follows.

get_current_seq_order()[source]
Returns:

many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.

Return type:

Sequence[int]

supports_seq_order_sorting() bool[source]
Returns:

whether “sorted” or “sorted_reverse” is supported for seq_ordering

initialize()[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

get_times(sorted_seq_idx)[source]
Parameters:

sorted_seq_idx (int)

get_data(seq_idx, key) ndarray[source]
Parameters:
  • seq_idx (int) – sorted seq idx

  • key (str) – data-key, e.g. “data” or “classes”

Returns:

features or targets: format 2d (time,feature) (float)

get_input_data(sorted_seq_idx: int) ndarray[source]

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

This default implementation assumes that there is a “data” data key, which is not necessarily true in all cases.

Parameters:

sorted_seq_idx

Returns features:

format 2d (time,feature) (float)

get_targets(target: str, sorted_seq_idx: int) ndarray[source]

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

Parameters:
  • target – data key

  • sorted_seq_idx

Returns targets:

format 1d (time) (int: idx of output-feature)

get_data_slice(seq_idx, key, start_frame, end_frame)[source]
Parameters:
  • seq_idx (int)

  • key (str)

  • start_frame (int)

  • end_frame (int)

Returns:

x[start_frame:end_frame], with x = get_data(seq_idx, key)

Return type:

numpy.ndarray

get_tag(sorted_seq_idx)[source]
Parameters:

sorted_seq_idx (int)

Return type:

str

get_all_tags()[source]
Returns:

list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.

Return type:

list[str]

get_total_num_seqs() int[source]
Returns:

total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.

have_corpus_seq_idx()[source]
Return type:

bool

Returns:

whether you can call self.get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx)[source]
Parameters:

seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering

Returns:

the sequence index as-is in the original corpus (as if you would have sorting=”default”). only defined if self.have_corpus_seq_idx()

Return type:

int

have_get_corpus_seq() bool[source]
Returns:

whether you can call get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) DatasetSeq[source]

This function allows random access directly into the corpus. Only implement this if such random access is possible in a reasonable efficient way. This allows to write map-style wrapper datasets around such RETURNN datasets.

Parameters:

corpus_seq_idx – corresponds to output of get_corpus_seq_idx()

Returns:

data

classmethod generic_complete_frac(seq_idx, num_seqs)[source]
Parameters:
  • seq_idx (int) – idx

  • num_seqs (int|None) – None if not available

Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

get_complete_frac(seq_idx)[source]
Parameters:

seq_idx (int)

Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

Return type:

float

property num_seqs: int[source]
Returns:

num seqs for current epoch

property estimated_num_seqs[source]
Returns:

estimated num seqs. does not have to be exact

Return type:

int|None

get_data_keys()[source]
Returns:

all available data keys (for get_data and all other functions)

Return type:

list[str]

get_target_list()[source]
Returns:

subset of get_data_keys(). target keys are usually not available during inference

Return type:

list[str]

get_data_dim(key)[source]
Parameters:

key (str) – e.g. “data” or “classes”

Returns:

number of classes, no matter if sparse or not

Return type:

int

get_data_dtype(key)[source]
Parameters:

key (str) – e.g. “data” or “classes”

Returns:

dtype as str, e.g. “int32” or “float32”

Return type:

str

is_data_sparse(key)[source]
Parameters:

key (str) – e.g. “data” or “classes”

Returns:

whether the data is sparse

Return type:

bool

get_data_shape(key: str) List[int][source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded

have_seqs() bool[source]
Returns:

whether num_seqs > 0

len_info()[source]
Return type:

str

:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.

is_less_than_num_seqs(n)[source]
Return type:

bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

can_serialize_data(key)[source]
Parameters:

key (str) – e.g. “classes”

Return type:

bool

serialize_data(key, data)[source]

In case you have a Vocabulary, just use Vocabulary.get_seq_labels().

Parameters:
  • key (str) – e.g. “classes”. self.labels[key] should be set

  • data (numpy.ndarray) – 0D or 1D

Return type:

str

iterate_seqs(recurrent_net=True, used_data_keys=None)[source]

Takes chunking into consideration.

Parameters:
  • recurrent_net (bool) – whether the order of frames matter

  • used_data_keys (set(str)|None)

Returns:

generator which yields tuples (seq index, seq start, seq end)

Return type:

list[(int,NumbersDict,NumbersDict)]

get_start_end_frames_full_seq(seq_idx)[source]
Parameters:

seq_idx (int)

Returns:

(start,end) frame, taking context_window into account

Return type:

(NumbersDict,NumbersDict)

sample(seq_idx)[source]
Parameters:

seq_idx (int)

Return type:

bool

batch_set_generator_cache_whole_epoch()[source]

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:

whether we should enable this cache

Return type:

bool

generate_batches(shuffle_batches=False, **kwargs)[source]
Parameters:
  • shuffle_batches (bool)

  • kwargs – will be passed to _generate_batches()

Return type:

BatchSetGenerator

classmethod index_shape_for_batches(batches, data_key='data')[source]
Parameters:
  • batches (list[EngineBatch.Batch])

  • data_key (str)

Returns:

shape as (time, batch)

Return type:

(int, int)

class returnn.datasets.basic.DatasetSeq(seq_idx, features, targets=None, seq_tag=None)[source]

Encapsulates all data for one sequence.

Parameters:
  • seq_idx (int) – sorted seq idx in the Dataset

  • features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)

  • targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)

  • seq_tag (str) – sequence name / tag

property num_frames[source]
Return type:

NumbersDict

get_data(key)[source]
Parameters:

key (str)

Return type:

numpy.ndarray

get_data_keys()[source]
Return type:

set[str]

returnn.datasets.basic.get_dataset_class(name: str | Type[Dataset]) Type[Dataset] | None[source]
Parameters:

name (str|type)

returnn.datasets.basic.init_dataset(kwargs: Dict[str, Any] | str | Callable[[], Dict[str, Any]] | Dataset, extra_kwargs: Dict[str, Any] | None = None, default_kwargs: Dict[str, Any] | None = None, *, parent_dataset: Dataset | None = None) Dataset[source]
Parameters:
  • kwargs

  • extra_kwargs

  • default_kwargs

  • parent_dataset – if given, will adapt some of the default_kwargs (when not set)

returnn.datasets.basic.extend_dataset_dict_from_parent_dataset(dataset_dict: Dict[str, Any], parent_dataset: Dataset | None) Dict[str, Any][source]
Parameters:
  • dataset_dict

  • parent_dataset

Returns:

extended dataset_dict

returnn.datasets.basic.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]
Parameters:
  • config_str (str) – hdf-files, or “LmDataset:…” or so

  • config (returnn.config.Config|None) – optional, only for “sprint:…”

  • cache_byte_size (int|None) – optional, only for HDFDataset

Return type:

Dataset

returnn.datasets.basic.convert_data_dims(data_dims, leave_dict_as_is=False)[source]

This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.

Parameters:
  • data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally

  • leave_dict_as_is (bool)

Return type:

dict[str,(int,int)|dict]

:returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))

(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)

returnn.datasets.basic.shapes_for_batches(batches: Sequence[Batch], *, data_keys: Sequence[str], dataset: Dataset | None = None, extern_data: TensorDict | None, enforce_min_len1: bool = False) Dict[str, List[int]] | None[source]
Parameters:
  • batches

  • data_keys

  • dataset

  • extern_data – detailed data description

  • enforce_min_len1

returnn.datasets.basic.set_config_extern_data_from_dataset(config, dataset)[source]
Parameters: