returnn.datasets.meta

There are use cases in which we want to combine several datasets:

  • Multimodality: features from several datasets should be provided at the same time

    • Examples: multi-source translation, speech translation with source CTC loss for stability (needs both source audio and transcription)

  • Multi-Task Learning: several datasets should be used alternatingly, such that at each time the dataset of the corresponding task is selected

    • Examples: multi-task speech translation (either from audio or from text)

  • Combination of Corpora: the training data should be split into different datatsets. This allows creating a combined corpus dynamically and avoids manual concatenation/shuffling.

    • Examples: multi-lingual translation systems (datasets can be reused from corresponding bilingual systems)

The dataset classes MetaDataset and CombinedDataset which perform these tasks are implemented in MetaDataset.py.

class returnn.datasets.meta.EpochWiseFilter(epochs_opts: Dict[Tuple[int, int | None], Dict[str, Any]], debug_msg_prefix: str = 'EpochWiseFilter')[source]

Applies some filter to the sequences (e.g. by seq length) for some epoch.

Parameters:
  • epochs_opts – (ep_start, ep_end) -> epoch opts

  • debug_msg_prefix

classmethod filter_epoch(opts: Dict[str, Any] | CollectionReadCheckCovered, seq_order: Sequence[int], get_seq_len: Callable[[int], int], debug_msg_prefix: str) List[int][source]
Parameters:
  • opts

  • seq_order – list of seq idxs

  • get_seq_len – seq idx -> len

  • debug_msg_prefix

Returns:

new seq_order

filter(epoch, seq_order, get_seq_len)[source]
Parameters:
  • epoch (int|None)

  • seq_order (Sequence[int]) – list of seq idxs

  • get_seq_len (((int)->int)) – seq idx -> len

Returns:

new seq_order

class returnn.datasets.meta.MetaDataset(datasets: Dict[str, Dict[str, Any]], data_map: Dict[str, Tuple[str, str]], seq_list_file: str | Dict[str, str] | None = None, seq_order_control_dataset: str | None = None, seq_lens_file: str | None = None, data_dims: Dict[str, Tuple[int, int]] | None = None, data_dtypes: Dict[str, str] | None = None, window: int = 1, **kwargs)[source]

The MetaDataset is to be used in the case of Multimodality. Here, the datasets are expected to describe different features of the same training sequences. These features will all be available to the network at the same time.

The datasets to be combined are given via the input parameter "datasets". To define which training examples from the different datasets belong together, a "seq_list_file" in pickle format has to be created. It contains a list of sequence tags for each dataset (see example below). Note, that in general each dataset type has its own tag format, e.g. for the TranslationDataset it is line-<n>, for the SprintDataset it is <corpusname>/<recording>/<segment id>. Providing a sequence list can be omitted, if the set of sequence tags is the same for all datasets. When using multiple ExternSprintDataset instances, the sprint segment file can be provided as sequence list. In this case the MetaDataset assumes that the sequences with equal tag correspond to each other. This e.g. works when combining TranslationDatasets if all the text files are sentence aligned.

Example of Sequence List:

{ 'sprint': [
    'corpus/ted_1/1',
    'corpus/ted_1/2',
    'corpus/ted_1/3',
    'corpus/ted_1/4'],
'translation': [
    'line-0',
    'line-1',
    'line-2',
    'line-3']
}

Python dict stored in pickle file. E.g. the sequence tagged with ‘corpus/ted_1/3’ in the ‘sprint’ dataset corresponds to the sequence tagged ‘line-2’ in the ‘translation’ dataset.

Example of MetaDataset config:

train = {"class": "MetaDataset", "seq_list_file": "seq_list.pkl",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {"data": ("sprint", "data"),
                      "target_text_sprint": ("sprint", "orth_classes"),
                      "source_text": ("translation", "data"),
                      "target_text": ("translation", "classes")},
         "seq_ordering": "random",
         "partition_epoch": 2,
}

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Sequence Sorting:

If the selected sequence order uses the length of the data (e.g. when using “sorted” or any kind of “laplace”), a sub-dataset has to be specified via seq_order_control_dataset. The desired sorting needs to be set as parameter in this sub-daset, setting seq_ordering for the MetaDataset will be ignored.

Parameters:
  • datasets – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’

  • data_map – self-data-key -> (dataset-key, dataset-data-key). Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.

  • seq_list_file

    filename. pickle (.pkl) or txt (line-based seq tags). optionally gzipped (.gz). If a single file, and pickled, it can directly contain the dict:

    dict[str,list[str]]: dataset-key -> list of sequence tags.

    If a dict, expect dataset-key -> filename. Can be None if tag format is the same for all datasets.

    Then the sequence list will be default sequence order of default dataset (data_map["data"][0]), or seq_order_control_dataset. You only need it if the tag name is not the same for all datasets. It will currently not act as filter, as the subdataset controls the sequence order (and thus what seqs to use).

  • seq_order_control_dataset – if set, this dataset will define the order for each epoch.

  • seq_lens_file – filename. json. dict[str,dict[str,int]], seq-tag -> data-key -> len. Use if getting sequence length from loading data is too costly.

  • data_dims – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double-check. Read from data if not specified.

  • data_dtypes – self-data-key -> dtype. Read from data if not specified. Deprecated, not used.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int|None)

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

supports_seq_order_sorting() bool[source]

supports sorting

supports_sharding() bool[source]
Returns:

whether this dataset supports sharding

get_current_seq_order()[source]
Returns:

current seq order for the current epoch, after self.init_seq_order was called.

Return type:

list[int]

get_all_tags()[source]
Returns:

list of all seq tags, of the whole dataset, without partition epoch

Return type:

list[str]

get_total_num_seqs(*, fast: bool = False) int[source]
Returns:

total number of seqs, without partition epoch

finish_epoch(*, free_resources: bool = False)[source]

This would get called at the end of the epoch.

get_seq_length(sorted_seq_idx)[source]
Parameters:

sorted_seq_idx (int)

Return type:

NumbersDict

get_tag(sorted_seq_idx)[source]
Parameters:

sorted_seq_idx (int)

Return type:

str

get_complete_frac(sorted_seq_idx: int, **kwargs) float | None[source]
Parameters:

sorted_seq_idx

get_data_keys() List[str][source]

data keys

get_target_list()[source]
Return type:

list[str]

get_data_shape(data_key)[source]
Parameters:

data_key (str)

Return type:

list[int]

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

is_data_sparse(key)[source]
Parameters:

key (str)

Return type:

bool

class returnn.datasets.meta.ClusteringDataset(dataset: Dict[str, Any], cluster_map_file: str, n_clusters: int, single_cluster: bool = False, **kwargs)[source]

This is a special case of MetaDataset, with one main subdataset, and we add a cluster-idx for each seq. We will read the cluster-map (seq-name -> cluster-idx) here directly.

Parameters:
  • dataset

  • cluster_map_file

  • n_clusters

  • single_cluster

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int)

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

get_data_keys()[source]
Return type:

list[str]

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

property num_seqs[source]
Return type:

int

is_less_than_num_seqs(n)[source]
Parameters:

n (int)

Return type:

bool

get_tag(seq_idx)[source]
Parameters:

seq_idx (int)

Return type:

str

class returnn.datasets.meta.ConcatDataset(datasets: Sequence[Dict[str, Any]], **kwargs)[source]

This concatenates multiple datasets. They are expected to provide the same data-keys and data-dimensions. It will go through the datasets always in order.

Parameters:

datasets – list of kwargs for init_dataset

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

property num_seqs[source]
Return type:

int

get_target_list()[source]
Return type:

list[str]

class returnn.datasets.meta.CombinedDataset(datasets: Dict[str, Dict[str, Any]], data_map: Dict[Tuple[str, str], str], sampling_sizes: None | int | Dict[str, int] = None, data_dims: Dict[str, Tuple[int, int]] | None = None, data_dtypes: Dict[str, str] | None = None, window: int = 1, **kwargs)[source]

The CombinedDataset is to be used in the cases of Multi-Task Learning and Combination of Corpora. Here, in general, the datasets describe different training sequences. For each sequence, only the features of the corresponding dataset will be available. Features of the other datasets are set to empty arrays. The input parameter "datasets" is the same as for the MetaDataset. The "data_map" is reversed to allow for several datasets mapping to the same feature. The "default" "seq_ordering" is to first go through all sequences of the first dataset, then the second and so on. All other sequence orderings ("random", "sorted", "laplace", …) are supported and based on this “default” ordering. There is a special sequence ordering "random_dataset", where we pick datasets at random, while keeping the sequence order within the datasets as is. To adjust the ratio of number of training examples from the different datasets in an epoch, one can use "repeat_epoch" in some of the datasets to increase their size relative to the others. Also, "partition_epoch" in some of the datasets can be used to shrink them relative to the others.

Example of CombinedDataset config:

train = {"class": "CombinedDataset",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {("sprint", "data"): "data",
                      ("sprint", "orth_classes"): "orth_classes",
                      ("translation", "data"): "source_text",
                      ("translation", "classes"): "orth_classes"},
         "seq_ordering": "default",
         "partition_epoch": 2,
 }

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Note: The mapping has been inverted. We now expect (dataset-key, dataset-data-key) -> self-data-key am-dataset:data -> am-data, am-dataset:classes -> am-classes, lm-dataset:data -> lm-data. For each sequence idx, it will select one of the given datasets, fill in the data-keys of this dataset and will return empty sequences for the remaining datasets. The default sequence ordering is to first go through all sequences of dataset 1, then dataset 2 and so on. If seq_ordering is set to ‘random_dataset’, we always pick one of the datasets at random (equally distributed over the sum of num-seqs), but still go through the sequences of a particular dataset in the order defined for it in the config (in order if not defined). For ‘sorted’ or ‘laplace’ the sequence length as provided by the datasets is used to sort all sequences jointly. Note, that this overrides the sequence order of the sub-datasets (also the case for ‘random’). ‘partition_epoch’ of the CombinedDataset is applied to the joint sequence order for all sequences. ‘partition_epoch’ of the sub-datasets is still applied. This can be used to adjust the relative size of the datasets. (However, do not combine ‘partition_epoch’ on both levels, as this leads to an unexpected selection of sequences.) To upscale a dataset, rather than downscaling the others via ‘partition_epoch’, use the ‘repeat_epoch’ option.

Also see MetaDataset.

Parameters:
  • datasets – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’

  • data_map – (dataset-key, dataset-data-key) -> self-data-key. Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.

  • sampling_sizes – dataset-key -> number-of-sequences. If set, the given fixed amount of sequences is taken from each dataset in every epoch (instead of using all). If an int is given, this number is used for all datasets. The sequences will be taken in the order provided by the sub-datasets nd we will loop back to the beginning of the dataset each time we reach the end. Sequence ordering will be applied after the sampling. Partition and repeat epoch are not supported when sampling.

  • data_dims – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.

  • data_dtypes – self-data-key -> dtype. Read from data if not specified.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int)

  • seq_list (list[str]|None)

  • seq_order (list[int]|None)

Return type:

bool

is_less_than_num_seqs(n)[source]
Parameters:

n (int)

Return type:

bool

get_target_list()[source]
Return type:

list[str]

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

get_data_dim(key)[source]
Parameters:

key (str)

Return type:

int

class returnn.datasets.meta.ConcatSeqsDataset(dataset, seq_list_file, seq_len_file, seq_tag_delim=';', remove_in_between_postfix=None, repeat_in_between_last_frame_up_to_multiple_of=None, pad_narrow_data_to_multiple_of_target_len=None, use_cache_manager=False, epoch_wise_filter=None, **kwargs)[source]

This takes another dataset, and concatenates one or multiple seqs.

Parameters:
  • dataset (dict[str]|str|Dataset) – kwargs for init_dataset

  • seq_list_file (str) – filename. line-separated. seq_tag_delim.join(seq_tags) for concatenated seqs

  • seq_len_file (str) – file with Python dict, (single) seg_name -> len, which is used for sorting

  • seq_tag_delim (str)

  • remove_in_between_postfix (dict[str,int]|None) – data_key -> expected postfix label. e.g. {“targets”: 0}

  • repeat_in_between_last_frame_up_to_multiple_of (dict[str,int]|None) – data_key -> multiple of. Example: you have downsampling factor 6, i.e. ceildiv(data_len, 6) == align_len. Now it could happen that ceildiv(data_len1 + data_len2, 6) < align_len1 + align_len2. This option would repeat intermediate ending frames such that data_len1 % 6 == 0, by setting it to {“data”: 6}.

  • pad_narrow_data_to_multiple_of_target_len (dict[str,(str,int)]|None) – data_key -> (target_key, multiple). Similar as repeat_in_between_last_frame_up_to_multiple_of, but works for more padding/alignment schemes. Example: align_len == ceildiv(data_len - P, F) for all your sub-sequences, where P is a custom number, repeat_in_between_last_frame_up_to_multiple_of would not work because align_len != ceildiv(data_len, F) This option would pad/narrow so that align_len * F == data_len for all but the last sub-sequences by setting it to {“data”: (“classes”, F)} to ensure concat_align_len == ceildiv(concat_data_len - P, F)

  • use_cache_manager (bool)

  • epoch_wise_filter (dict[(int,int),dict]) – see EpochWiseFilter

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int)

  • seq_list (list[str]|None)

  • seq_order (list[int]|None)

Return type:

bool

supports_seq_order_sorting() bool[source]

supports sorting

have_corpus_seq_idx()[source]
Return type:

bool

get_corpus_seq_idx(seq_idx)[source]
Parameters:

seq_idx (int)

Return type:

int

get_data_keys()[source]
Return type:

list[str]

get_target_list()[source]
Return type:

list[str]

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

get_data_dim(key)[source]
Parameters:

key (str)

Return type:

int

is_data_sparse(key)[source]
Parameters:

key (str)

Return type:

bool

get_data_shape(key)[source]
Parameters:

key (str)

Return type:

list[int]

get_total_num_seqs(*, fast: bool = False) int[source]

total num seqs

class returnn.datasets.meta.ChunkShuffleDataset(dataset: Dict[str, Any], chunk_shuffle_cache: int = 1000, batch_gen_batch_size: int = 5000, batch_gen_max_seqs: int = 1, batch_gen_recurrent_net: bool = True, **kwargs)[source]

This goes through a dataset, caches some recent chunks

Parameters:

dataset (dict[str]) – kwargs for init_dataset

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • seq_list (list[str]|None)

  • seq_order (list[int]|None)

is_less_than_num_seqs(seq_idx)[source]
Return type:

bool

:returns whether seq_idx < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

get_target_list()[source]
Return type:

list[str]

class returnn.datasets.meta.VariableDataset(*, get_dataset, dataset_lru_cache_size: int = 1, **kwargs)[source]

For every (sub)epoch, it would generate a new subdataset, based on a user-provided function.

Parameters:

get_dataset – function (*, epoch: int, **_) -> Dict[str,Any], will be called for every sub-epoch. It will cache the dataset(s) from the prev call (dataset_lru_cache_size), and if the dict is the same of those, it will not recreate the dataset.

:param dataset_lru_cache_size

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]

init seq order

finish_epoch(*, free_resources: bool = False)[source]

finish epoch

supports_seq_order_sorting() bool[source]

supports sorting

get_current_seq_order() Sequence[int][source]

current seq order

get_seq_length(sorted_seq_idx: int) NumbersDict[source]

seq len

get_tag(sorted_seq_idx: int) str[source]

tag

get_data_keys() List[str][source]

data keys

get_target_list() List[str][source]

target list

is_cached(start: int, end: int) bool[source]

is cached

property num_seqs: int[source]

num seqs

is_less_than_num_seqs(n: int) bool[source]

n < num_seqs

get_num_timesteps() int[source]

num timesteps

load_seqs(start: int, end: int)[source]

load seqs

get_data(seq_idx: int, key: str) ndarray[source]

data

get_input_data(seq_idx: int) ndarray[source]

input data

get_targets(target: str, seq_idx: int) ndarray[source]

target data

get_data_dim(key: str) int[source]

data dim

get_data_shape(data_key: str) List[int][source]

data shape

get_data_dtype(key: str) str[source]

data dtype

is_data_sparse(key: str) bool[source]

is data sparse

class returnn.datasets.meta.MultiEpochDataset(*, dataset: Dict[str, Any], multi_epoch: int, **kwargs)[source]

It wraps some dataset, where one outer epoch corresponds to multiple epochs in the inner wrapped dataset.

This can be useful when the inner dataset uses partition_epoch, and we want to cover the whole full epoch.

One specific example when the data is distributed over multiple files, and for reasonable performance, you want to have the data copied to the local disk, but all data together is too large to fit on the local disk. Then DistributeFilesDataset is the logical choice, which solves these issues. However, you must use some partition_epoch in DistributeFilesDataset such that it will not load all data at once. To cover all the data, you can use this MultiEpochDataset and set multi_epoch = partition_epoch of the inner dataset.

Parameters:
  • dataset – the inner wrapped dataset

  • multi_epoch – how much inner epochs correspond to one outer epoch

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]

init seq order

finish_epoch(*, free_resources: bool = False)[source]

finish epoch

get_all_tags() List[str][source]

all tags

get_total_num_seqs(*, fast: bool = False) int[source]

total num seqs

get_data_keys() List[str][source]

data keys

get_target_list() List[str][source]

target list

get_data_dim(key: str) int[source]

data dim

get_data_shape(data_key: str) List[int][source]

data shape

get_data_dtype(key: str) str[source]

data dtype

is_data_sparse(key: str) bool[source]

is data sparse

class returnn.datasets.meta.AnythingDataset(*, data_keys: Dict[str, Dict[str, Any]], **kwargs)[source]

An infinite dataset, creating dummy (zero) data on the fly, given the data-keys and their shapes.

When this is used inside a MetaDataset, controlled by the seqs list from another dataset, it will just take over whatever seq list is given.

Parameters:

data_keys – similar like extern_data. defines shape, dtype, sparse, dim, etc

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]

init seq order

supports_seq_order_sorting() bool[source]

supports sorting

get_current_seq_order() Sequence[int][source]

current seq order

have_seqs() bool[source]

total num seqs

get_seq_length(sorted_seq_idx: int) NumbersDict[source]

seq len

get_tag(sorted_seq_idx: int) str[source]

tag

get_data_keys() List[str][source]

data keys

get_target_list() List[str][source]

target list

is_cached(start: int, end: int) bool[source]

is cached

property num_seqs: int[source]

num seqs

is_less_than_num_seqs(n: int) bool[source]

n < num_seqs

get_data(seq_idx: int, key: str) ndarray[source]

data

get_input_data(seq_idx: int) ndarray[source]

input data

get_targets(target: str, seq_idx: int) ndarray[source]

target data

get_data_dim(key: str) int[source]

data dim

get_data_shape(data_key: str) List[int][source]

data shape

get_data_dtype(key: str) str[source]

data dtype

is_data_sparse(key: str) bool[source]

is data sparse