returnn.datasets.meta#

There are use cases in which we want to combine several datasets:

  • Multimodality: features from several datasets should be provided at the same time

    • Examples: multi-source translation, speech translation with source CTC loss for stability (needs both source audio and transcription)

  • Multi-Task Learning: several datasets should be used alternatingly, such that at each time the dataset of the corresponding task is selected

    • Examples: multi-task speech translation (either from audio or from text)

  • Combination of Corpora: the training data should be split into different datatsets. This allows creating a combined corpus dynamically and avoids manual concatenation/shuffling.

    • Examples: multi-lingual translation systems (datasets can be reused from corresponding bilingual systems)

The dataset classes MetaDataset and CombinedDataset which perform these tasks are implemented in MetaDataset.py.

class returnn.datasets.meta.EpochWiseFilter(epochs_opts, debug_msg_prefix='EpochWiseFilter')[source]#

Applies some filter to the sequences (e.g. by seq length) for some epoch.

Parameters:
  • epochs_opts (dict[(int,int|None),dict[str]]) – (ep_start, ep_end) -> epoch opts

  • debug_msg_prefix (str) –

classmethod filter_epoch(opts, seq_order, get_seq_len, debug_msg_prefix)[source]#
Parameters:
Returns:

new seq_order

Return type:

list[int]

filter(epoch, seq_order, get_seq_len)[source]#
Parameters:
  • epoch (int|None) –

  • seq_order (Sequence[int]) – list of seq idxs

  • get_seq_len (((int)->int)) – seq idx -> len

Returns:

new seq_order

class returnn.datasets.meta.MetaDataset(datasets, data_map, seq_list_file=None, seq_order_control_dataset=None, seq_lens_file=None, data_dims=None, data_dtypes=None, window=1, **kwargs)[source]#

The MetaDataset is to be used in the case of Multimodality. Here, the datasets are expected to describe different features of the same training sequences. These features will all be available to the network at the same time.

The datasets to be combined are given via the input parameter "datasets". To define which training examples from the different datasets belong together, a "seq_list_file" in pickle format has to be created. It contains a list of sequence tags for each dataset (see example below). Note, that in general each dataset type has its own tag format, e.g. for the TranslationDataset it is line-<n>, for the SprintDataset it is <corpusname>/<recording>/<segment id>. Providing a sequence list can be omitted, if the set of sequence tags is the same for all datasets. When using multiple ExternSprintDataset instances, the sprint segment file can be provided as sequence list. In this case the MetaDataset assumes that the sequences with equal tag correspond to each other. This e.g. works when combining TranslationDatasets if all the text files are sentence aligned.

Example of Sequence List:

{ 'sprint': [
    'corpus/ted_1/1',
    'corpus/ted_1/2',
    'corpus/ted_1/3',
    'corpus/ted_1/4',
'translation': [
    'line-0',
    'line-1',
    'line-2',
    'line-3']
}

Python dict stored in pickle file. E.g. the sequence tagged with ‘corpus/ted_1/3’ in the ‘sprint’ dataset corresponds to the sequence tagged ‘line-2’ in the ‘translation’ dataset.

Example of MetaDataset config:

train = {"class": "MetaDataset", "seq_list_file": "seq_list.pkl",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {"data": ("sprint", "data"),
                      "target_text_sprint": ("sprint", "orth_classes"),
                      "source_text": ("translation", "data"),
                      "target_text": ("translation", "classes")},
         "seq_ordering": "random",
         "partition_epoch": 2,
}

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Sequence Sorting:

If the selected sequence order uses the length of the data (e.g. when using “sorted” or any kind of “laplace”), a sub-dataset has to be specified via seq_order_control_dataset. The desired sorting needs to be set as parameter in this sub-daset, setting seq_ordering for the MetaDataset will be ignored.

Parameters:
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’

  • data_map (dict[str,(str,str)]) – self-data-key -> (dataset-key, dataset-data-key). Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.

  • seq_list_file (str|None) –

    filename. pickle. dict[str,list[str]], dataset-key -> list of sequence tags. Can be None if tag format is the same for all datasets.

    Then the sequence list will be default sequence order of default dataset (data_map["data"][0]), or seq_order_control_dataset. You only need it if the tag name is not the same for all datasets. It will currently not act as filter, as the subdataset controls the sequence order (and thus what seqs to use).

  • seq_order_control_dataset (str|None) – if set, this dataset will define the order for each epoch.

  • seq_lens_file (str|None) – filename. json. dict[str,dict[str,int]], seq-tag -> data-key -> len. Use if getting sequence length from loading data is too costly.

  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.

  • data_dtypes (dict[str,str]) – self-data-key -> dtype. Read from data if not specified.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

supports_seq_order_sorting() bool[source]#

supports sorting

get_current_seq_order()[source]#
Returns:

current seq order for the current epoch, after self.init_seq_order was called.

Return type:

list[int]

get_all_tags()[source]#
Returns:

list of all seq tags, of the whole dataset, without partition epoch

Return type:

list[str]

get_total_num_seqs() int[source]#
Returns:

total number of seqs, without partition epoch

finish_epoch(*, free_resources: bool = False)[source]#

This would get called at the end of the epoch.

get_seq_length(sorted_seq_idx)[source]#
Parameters:

sorted_seq_idx (int) –

Return type:

NumbersDict

get_tag(sorted_seq_idx)[source]#
Parameters:

sorted_seq_idx (int) –

Return type:

str

get_data_keys() List[str][source]#

data keys

get_target_list()[source]#
Return type:

list[str]

get_data_shape(data_key)[source]#
Parameters:

data_key (str) –

Return type:

list[int]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

is_data_sparse(key)[source]#
Parameters:

key (str) –

Return type:

bool

class returnn.datasets.meta.ClusteringDataset(dataset, cluster_map_file, n_clusters, single_cluster=False, **kwargs)[source]#

This is a special case of MetaDataset, with one main subdataset, and we add a cluster-idx for each seq. We will read the cluster-map (seq-name -> cluster-idx) here directly.

Parameters:
  • dataset (dict[str]) –

  • cluster_map_file

  • n_clusters (int) –

  • single_cluster

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

get_data_keys()[source]#
Return type:

list[str]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

property num_seqs[source]#
Return type:

int

is_less_than_num_seqs(n)[source]#
Parameters:

n (int) –

Return type:

bool

get_tag(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

str

class returnn.datasets.meta.ConcatDataset(datasets, **kwargs)[source]#

This concatenates multiple datasets. They are expected to provide the same data-keys and data-dimensions. It will go through the datasets always in order.

Parameters:

datasets (list[dict[str]]) – list of kwargs for init_dataset

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

property num_seqs[source]#
Return type:

int

get_target_list()[source]#
Return type:

list[str]

class returnn.datasets.meta.CombinedDataset(datasets, data_map, data_dims=None, data_dtypes=None, sampling_sizes=None, window=1, **kwargs)[source]#

The CombinedDataset is to be used in the cases of Multi-Task Learning and Combination of Corpora. Here, in general, the datasets describe different training sequences. For each sequence, only the features of the corresponding dataset will be available. Features of the other datasets are set to empty arrays. The input parameter "datasets" is the same as for the MetaDataset. The "data_map" is reversed to allow for several datasets mapping to the same feature. The "default" "seq_ordering" is to first go through all sequences of the first dataset, then the second and so on. All other sequence orderings ("random", "sorted", "laplace", …) are supported and based on this “default” ordering. There is a special sequence ordering "random_dataset", where we pick datasets at random, while keeping the sequence order within the datasets as is. To adjust the ratio of number of training examples from the different datasets in an epoch, one can use "repeat_epoch" in some of the datasets to increase their size relative to the others. Also, "partition_epoch" in some of the datasets can be used to shrink them relative to the others.

Example of CombinedDataset config:

train = {"class": "CombinedDataset",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {("sprint", "data"): "data",
                      ("sprint", "orth_classes"): "orth_classes",
                      ("translation", "data"): "source_text",
                      ("translation", "classes"): "orth_classes"},
         "seq_ordering": "default",
         "partition_epoch": 2,
 }

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Note: The mapping has been inverted. We now expect (dataset-key, dataset-data-key) -> self-data-key am-dataset:data -> am-data, am-dataset:classes -> am-classes, lm-dataset:data -> lm-data. For each sequence idx, it will select one of the given datasets, fill in the data-keys of this dataset and will return empty sequences for the remaining datasets. The default sequence ordering is to first go through all sequences of dataset 1, then dataset 2 and so on. If seq_ordering is set to ‘random_dataset’, we always pick one of the datasets at random (equally distributed over the sum of num-seqs), but still go through the sequences of a particular dataset in the order defined for it in the config (in order if not defined). For ‘sorted’ or ‘laplace’ the sequence length as provided by the datasets is used to sort all sequences jointly. Note, that this overrides the sequence order of the sub-datasets (also the case for ‘random’). ‘partition_epoch’ of the CombinedDataset is applied to the joint sequence order for all sequences. ‘partition_epoch’ of the sub-datasets is still applied. This can be used to adjust the relative size of the datasets. (However, do not combine ‘partition_epoch’ on both levels, as this leads to an unexpected selection of sequences.) To upscale a dataset, rather than downscaling the others via ‘partition_epoch’, use the ‘repeat_epoch’ option.

Also see MetaDataset.

Parameters:
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’

  • data_map (dict[(str,str),str]) – (dataset-key, dataset-data-key) -> self-data-key. Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.

  • sampling_sizes (dict[str,int]|int) – dataset-key -> number-of-sequences. If set, the given fixed amount of sequences is taken from each dataset in every epoch (instead of using all). If an int is given, this number is used for all datasets. The sequences will be taken in the order provided by the sub-datasets and we will loop back to the beginning of the dataset each time we reach the end. Sequence ordering will be applied after the sampling. Partition and repeat epoch are not supported when sampling.

  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.

  • data_dtypes (dict[str,str]) – self-data-key -> dtype. Read from data if not specified.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int) –

  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

Return type:

bool

is_less_than_num_seqs(n)[source]#
Parameters:

n (int) –

Return type:

bool

get_target_list()[source]#
Return type:

list[str]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

get_data_dim(key)[source]#
Parameters:

key (str) –

Return type:

int

class returnn.datasets.meta.ConcatSeqsDataset(dataset, seq_list_file, seq_len_file, seq_tag_delim=';', remove_in_between_postfix=None, repeat_in_between_last_frame_up_to_multiple_of=None, use_cache_manager=False, epoch_wise_filter=None, **kwargs)[source]#

This takes another dataset, and concatenates one or multiple seqs.

Parameters:
  • dataset (dict[str]|str|Dataset) – kwargs for init_dataset

  • seq_list_file (str) – filename. line-separated. seq_tag_delim.join(seq_tags) for concatenated seqs

  • seq_len_file (str) – file with Python dict, (single) seg_name -> len, which is used for sorting

  • seq_tag_delim (str) –

  • remove_in_between_postfix (dict[str,int]|None) – data_key -> expected postfix label. e.g. {“targets”: 0}

  • repeat_in_between_last_frame_up_to_multiple_of (dict[str,int]|None) – data_key -> multiple of. Example: you have downsampling factor 6, i.e. ceildiv(data_len, 6) == align_len. Now it could happen that ceildiv(data_len1 + data_len2, 6) < align_len1 + align_len2. This option would repeat intermediate ending frames such that data_len1 % 6 == 0, by setting it to {“data”: 6}.

  • use_cache_manager (bool) –

  • epoch_wise_filter (dict[(int,int),dict]) – see EpochWiseFilter

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int) –

  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

Return type:

bool

supports_seq_order_sorting() bool[source]#

supports sorting

have_corpus_seq_idx()[source]#
Return type:

bool

get_corpus_seq_idx(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

int

get_data_keys()[source]#
Return type:

list[str]

get_target_list()[source]#
Return type:

list[str]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

get_data_dim(key)[source]#
Parameters:

key (str) –

Return type:

int

is_data_sparse(key)[source]#
Parameters:

key (str) –

Return type:

bool

get_data_shape(key)[source]#
Parameters:

key (str) –

Return type:

list[int]

class returnn.datasets.meta.ChunkShuffleDataset(dataset, chunk_shuffle_cache=1000, batch_gen_batch_size=5000, batch_gen_max_seqs=1, batch_gen_recurrent_net=True, **kwargs)[source]#

This goes through a dataset, caches some recent chunks

Parameters:

dataset (dict[str]) – kwargs for init_dataset

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

is_less_than_num_seqs(seq_idx)[source]#
Return type:

bool

:returns whether seq_idx < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

get_target_list()[source]#
Return type:

list[str]