returnn.datasets.meta
¶
There are use cases in which we want to combine several datasets:
Multimodality: features from several datasets should be provided at the same time
Examples: multi-source translation, speech translation with source CTC loss for stability (needs both source audio and transcription)
Multi-Task Learning: several datasets should be used alternatingly, such that at each time the dataset of the corresponding task is selected
Examples: multi-task speech translation (either from audio or from text)
Combination of Corpora: the training data should be split into different datatsets. This allows creating a combined corpus dynamically and avoids manual concatenation/shuffling.
Examples: multi-lingual translation systems (datasets can be reused from corresponding bilingual systems)
The dataset classes MetaDataset and CombinedDataset which perform these tasks are implemented in MetaDataset.py.
- class returnn.datasets.meta.EpochWiseFilter(epochs_opts: Dict[Tuple[int, int | None], Dict[str, Any]], debug_msg_prefix: str = 'EpochWiseFilter')[source]¶
Applies some filter to the sequences (e.g. by seq length) for some epoch.
- Parameters:
epochs_opts – (ep_start, ep_end) -> epoch opts
debug_msg_prefix
- classmethod filter_epoch(opts: Dict[str, Any] | CollectionReadCheckCovered, seq_order: Sequence[int], get_seq_len: Callable[[int], int], debug_msg_prefix: str) List[int] [source]¶
- Parameters:
opts
seq_order – list of seq idxs
get_seq_len – seq idx -> len
debug_msg_prefix
- Returns:
new seq_order
- class returnn.datasets.meta.MetaDataset(datasets: Dict[str, Dict[str, Any]], data_map: Dict[str, Tuple[str, str]], seq_list_file: str | Dict[str, str] | None = None, seq_order_control_dataset: str | None = None, seq_lens_file: str | None = None, data_dims: Dict[str, Tuple[int, int]] | None = None, data_dtypes: Dict[str, str] | None = None, window: int = 1, **kwargs)[source]¶
The MetaDataset is to be used in the case of Multimodality. Here, the datasets are expected to describe different features of the same training sequences. These features will all be available to the network at the same time.
The datasets to be combined are given via the input parameter
"datasets"
. To define which training examples from the different datasets belong together, a"seq_list_file"
in pickle format has to be created. It contains a list of sequence tags for each dataset (see example below). Note, that in general each dataset type has its own tag format, e.g. for the TranslationDataset it isline-<n>
, for the SprintDataset it is<corpusname>/<recording>/<segment id>
. Providing a sequence list can be omitted, if the set of sequence tags is the same for all datasets. When using multiple ExternSprintDataset instances, the sprint segment file can be provided as sequence list. In this case the MetaDataset assumes that the sequences with equal tag correspond to each other. This e.g. works when combining TranslationDatasets if all the text files are sentence aligned.Example of Sequence List:
{ 'sprint': [ 'corpus/ted_1/1', 'corpus/ted_1/2', 'corpus/ted_1/3', 'corpus/ted_1/4'], 'translation': [ 'line-0', 'line-1', 'line-2', 'line-3'] }
Python dict stored in pickle file. E.g. the sequence tagged with ‘corpus/ted_1/3’ in the ‘sprint’ dataset corresponds to the sequence tagged ‘line-2’ in the ‘translation’ dataset.
Example of MetaDataset config:
train = {"class": "MetaDataset", "seq_list_file": "seq_list.pkl", "datasets": {"sprint": train_sprint, "translation": train_translation}, "data_map": {"data": ("sprint", "data"), "target_text_sprint": ("sprint", "orth_classes"), "source_text": ("translation", "data"), "target_text": ("translation", "classes")}, "seq_ordering": "random", "partition_epoch": 2, }
This combines a SprintDataset and a TranslationDataset. These are defined as
"train_sprint"
and"train_translation"
separately. Note that the current implementation expects one input feature to be called “data”.Sequence Sorting:
If the selected sequence order uses the length of the data (e.g. when using “sorted” or any kind of “laplace”), a sub-dataset has to be specified via
seq_order_control_dataset
. The desired sorting needs to be set as parameter in this sub-daset, settingseq_ordering
for the MetaDataset will be ignored.- Parameters:
datasets – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
data_map – self-data-key -> (dataset-key, dataset-data-key). Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
seq_list_file –
filename. pickle (.pkl) or txt (line-based seq tags). optionally gzipped (.gz). If a single file, and pickled, it can directly contain the dict:
dict[str,list[str]]: dataset-key -> list of sequence tags.
If a dict, expect dataset-key -> filename. Can be None if tag format is the same for all datasets.
Then the sequence list will be default sequence order of default dataset (
data_map["data"][0]
), or seq_order_control_dataset. You only need it if the tag name is not the same for all datasets. It will currently not act as filter, as the subdataset controls the sequence order (and thus what seqs to use).seq_order_control_dataset – if set, this dataset will define the order for each epoch.
seq_lens_file – filename. json. dict[str,dict[str,int]], seq-tag -> data-key -> len. Use if getting sequence length from loading data is too costly.
data_dims – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double-check. Read from data if not specified.
data_dtypes – self-data-key -> dtype. Read from data if not specified. Deprecated, not used.
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Return type:
bool
- get_current_seq_order()[source]¶
- Returns:
current seq order for the current epoch, after self.init_seq_order was called.
- Return type:
list[int]
- get_all_tags()[source]¶
- Returns:
list of all seq tags, of the whole dataset, without partition epoch
- Return type:
list[str]
- get_total_num_seqs(*, fast: bool = False) int [source]¶
- Returns:
total number of seqs, without partition epoch
- finish_epoch(*, free_resources: bool = False)[source]¶
This would get called at the end of the epoch.
- class returnn.datasets.meta.ClusteringDataset(dataset: Dict[str, Any], cluster_map_file: str, n_clusters: int, single_cluster: bool = False, **kwargs)[source]¶
This is a special case of MetaDataset, with one main subdataset, and we add a cluster-idx for each seq. We will read the cluster-map (seq-name -> cluster-idx) here directly.
- Parameters:
dataset
cluster_map_file
n_clusters
single_cluster
- class returnn.datasets.meta.ConcatDataset(datasets: Sequence[Dict[str, Any]], **kwargs)[source]¶
This concatenates multiple datasets. They are expected to provide the same data-keys and data-dimensions. It will go through the datasets always in order.
- Parameters:
datasets – list of kwargs for init_dataset
- class returnn.datasets.meta.CombinedDataset(datasets: Dict[str, Dict[str, Any]], data_map: Dict[Tuple[str, str], str], sampling_sizes: None | int | Dict[str, int] = None, data_dims: Dict[str, Tuple[int, int]] | None = None, data_dtypes: Dict[str, str] | None = None, window: int = 1, **kwargs)[source]¶
The CombinedDataset is to be used in the cases of Multi-Task Learning and Combination of Corpora. Here, in general, the datasets describe different training sequences. For each sequence, only the features of the corresponding dataset will be available. Features of the other datasets are set to empty arrays. The input parameter
"datasets"
is the same as for the MetaDataset. The"data_map"
is reversed to allow for several datasets mapping to the same feature. The"default"
"seq_ordering"
is to first go through all sequences of the first dataset, then the second and so on. All other sequence orderings ("random"
,"sorted"
,"laplace"
, …) are supported and based on this “default” ordering. There is a special sequence ordering"random_dataset"
, where we pick datasets at random, while keeping the sequence order within the datasets as is. To adjust the ratio of number of training examples from the different datasets in an epoch, one can use"repeat_epoch"
in some of the datasets to increase their size relative to the others. Also,"partition_epoch"
in some of the datasets can be used to shrink them relative to the others.Example of CombinedDataset config:
train = {"class": "CombinedDataset", "datasets": {"sprint": train_sprint, "translation": train_translation}, "data_map": {("sprint", "data"): "data", ("sprint", "orth_classes"): "orth_classes", ("translation", "data"): "source_text", ("translation", "classes"): "orth_classes"}, "seq_ordering": "default", "partition_epoch": 2, }
This combines a SprintDataset and a TranslationDataset. These are defined as
"train_sprint"
and"train_translation"
separately. Note that the current implementation expects one input feature to be called “data”.Note: The mapping has been inverted. We now expect (dataset-key, dataset-data-key) -> self-data-key am-dataset:data -> am-data, am-dataset:classes -> am-classes, lm-dataset:data -> lm-data. For each sequence idx, it will select one of the given datasets, fill in the data-keys of this dataset and will return empty sequences for the remaining datasets. The default sequence ordering is to first go through all sequences of dataset 1, then dataset 2 and so on. If seq_ordering is set to ‘random_dataset’, we always pick one of the datasets at random (equally distributed over the sum of num-seqs), but still go through the sequences of a particular dataset in the order defined for it in the config (in order if not defined). For ‘sorted’ or ‘laplace’ the sequence length as provided by the datasets is used to sort all sequences jointly. Note, that this overrides the sequence order of the sub-datasets (also the case for ‘random’). ‘partition_epoch’ of the CombinedDataset is applied to the joint sequence order for all sequences. ‘partition_epoch’ of the sub-datasets is still applied. This can be used to adjust the relative size of the datasets. (However, do not combine ‘partition_epoch’ on both levels, as this leads to an unexpected selection of sequences.) To upscale a dataset, rather than downscaling the others via ‘partition_epoch’, use the ‘repeat_epoch’ option.
Also see
MetaDataset
.- Parameters:
datasets – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
data_map – (dataset-key, dataset-data-key) -> self-data-key. Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
sampling_sizes – dataset-key -> number-of-sequences. If set, the given fixed amount of sequences is taken from each dataset in every epoch (instead of using all). If an int is given, this number is used for all datasets. The sequences will be taken in the order provided by the sub-datasets nd we will loop back to the beginning of the dataset each time we reach the end. Sequence ordering will be applied after the sampling. Partition and repeat epoch are not supported when sampling.
data_dims – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.
data_dtypes – self-data-key -> dtype. Read from data if not specified.
- class returnn.datasets.meta.ConcatSeqsDataset(dataset, seq_list_file, seq_len_file, seq_tag_delim=';', remove_in_between_postfix=None, repeat_in_between_last_frame_up_to_multiple_of=None, pad_narrow_data_to_multiple_of_target_len=None, use_cache_manager=False, epoch_wise_filter=None, **kwargs)[source]¶
This takes another dataset, and concatenates one or multiple seqs.
- Parameters:
dataset (dict[str]|str|Dataset) – kwargs for init_dataset
seq_list_file (str) – filename. line-separated. seq_tag_delim.join(seq_tags) for concatenated seqs
seq_len_file (str) – file with Python dict, (single) seg_name -> len, which is used for sorting
seq_tag_delim (str)
remove_in_between_postfix (dict[str,int]|None) – data_key -> expected postfix label. e.g. {“targets”: 0}
repeat_in_between_last_frame_up_to_multiple_of (dict[str,int]|None) – data_key -> multiple of. Example: you have downsampling factor 6, i.e. ceildiv(data_len, 6) == align_len. Now it could happen that ceildiv(data_len1 + data_len2, 6) < align_len1 + align_len2. This option would repeat intermediate ending frames such that data_len1 % 6 == 0, by setting it to {“data”: 6}.
pad_narrow_data_to_multiple_of_target_len (dict[str,(str,int)]|None) – data_key -> (target_key, multiple). Similar as repeat_in_between_last_frame_up_to_multiple_of, but works for more padding/alignment schemes. Example: align_len == ceildiv(data_len - P, F) for all your sub-sequences, where P is a custom number, repeat_in_between_last_frame_up_to_multiple_of would not work because align_len != ceildiv(data_len, F) This option would pad/narrow so that align_len * F == data_len for all but the last sub-sequences by setting it to {“data”: (“classes”, F)} to ensure concat_align_len == ceildiv(concat_data_len - P, F)
use_cache_manager (bool)
epoch_wise_filter (dict[(int,int),dict]) – see
EpochWiseFilter
- class returnn.datasets.meta.ChunkShuffleDataset(dataset: Dict[str, Any], chunk_shuffle_cache: int = 1000, batch_gen_batch_size: int = 5000, batch_gen_max_seqs: int = 1, batch_gen_recurrent_net: bool = True, **kwargs)[source]¶
This goes through a dataset, caches some recent chunks
- Parameters:
dataset (dict[str]) – kwargs for init_dataset
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
seq_list (list[str]|None)
seq_order (list[int]|None)
- class returnn.datasets.meta.VariableDataset(*, get_dataset, dataset_lru_cache_size: int = 1, **kwargs)[source]¶
For every (sub)epoch, it would generate a new subdataset, based on a user-provided function.
- Parameters:
get_dataset – function (*, epoch: int, **_) -> Dict[str,Any], will be called for every sub-epoch. It will cache the dataset(s) from the prev call (dataset_lru_cache_size), and if the dict is the same of those, it will not recreate the dataset.
:param dataset_lru_cache_size
- get_seq_length(sorted_seq_idx: int) NumbersDict [source]¶
seq len
- class returnn.datasets.meta.MultiEpochDataset(*, dataset: Dict[str, Any], multi_epoch: int, **kwargs)[source]¶
It wraps some dataset, where one outer epoch corresponds to multiple epochs in the inner wrapped dataset.
This can be useful when the inner dataset uses partition_epoch, and we want to cover the whole full epoch.
One specific example when the data is distributed over multiple files, and for reasonable performance, you want to have the data copied to the local disk, but all data together is too large to fit on the local disk. Then
DistributeFilesDataset
is the logical choice, which solves these issues. However, you must use some partition_epoch inDistributeFilesDataset
such that it will not load all data at once. To cover all the data, you can use thisMultiEpochDataset
and set multi_epoch = partition_epoch of the inner dataset.- Parameters:
dataset – the inner wrapped dataset
multi_epoch – how much inner epochs correspond to one outer epoch
- class returnn.datasets.meta.AnythingDataset(*, data_keys: Dict[str, Dict[str, Any]], **kwargs)[source]¶
An infinite dataset, creating dummy (zero) data on the fly, given the data-keys and their shapes.
When this is used inside a
MetaDataset
, controlled by the seqs list from another dataset, it will just take over whatever seq list is given.- Parameters:
data_keys – similar like extern_data. defines shape, dtype, sparse, dim, etc
- get_seq_length(sorted_seq_idx: int) NumbersDict [source]¶
seq len