MetaDataset

There are use cases in which we want to combine several datasets:

  • Multimodality: features from several datasets should be provided at the same time
    • Examples: multi-source translation, speech translation with source CTC loss for stability (needs both source audio and transcription)
  • Multi-Task Learning: several datasets should be used alternatingly, such that at each time the dataset of the corresponding task is selected
    • Examples: multi-task speech translation (either from audio or from text)
  • Combination of Corpora: the training data should be split into different datatsets. This allows creating a combined corpus dynamically and avoids manual concatenation/shuffling.
    • Examples: multi-lingual translation systems (datasets can be reused from corresponding bilingual systems)

The dataset classes MetaDataset and CombinedDataset which perform these tasks are implemented in MetaDataset.py.

class MetaDataset.EpochWiseFilter(epochs_opts, debug_msg_prefix='EpochWiseFilter')[source]

Applies some filter to the sequences (e.g. by seq length) for some epoch.

Parameters:
  • epochs_opts (dict[(int,int|None),dict[str]]) – (ep_start, ep_end) -> epoch opts
  • debug_msg_prefix (str) –
classmethod filter_epoch(opts, seq_order, get_seq_len, debug_msg_prefix)[source]
Parameters:
  • opts (dict[str]|Util.CollectionReadCheckCovered) –
  • seq_order (list[int]) – list of seq idxs
  • get_seq_len (((int)->int)) – seq idx -> len
  • debug_msg_prefix (str) –
Returns:

new seq_order

Return type:

list[int]

filter(self, epoch, seq_order, get_seq_len)[source]
Parameters:
  • epoch (int|None) –
  • seq_order (list[int]) – list of seq idxs
  • get_seq_len (((int)->int)) – seq idx -> len
Returns:

new seq_order

class MetaDataset.MetaDataset(datasets, data_map, seq_list_file=None, seq_order_control_dataset=None, seq_lens_file=None, data_dims=None, data_dtypes=None, window=1, **kwargs)[source]

The MetaDataset is to be used in the case of Multimodality. Here, the datasets are expected to describe different features of the same training sequences. These features will all be available to the network at the same time.

The datasets to be combined are given via the input parameter "datasets". To define which training examples from the different datasets belong together, a "seq_list_file" in pickle format has to be created. It contains a list of sequence tags for each dataset (see example below). Note, that in general each dataset type has its own tag format, e.g. for the TranslationDataset it is line-<n>, for the SprintDataset it is <corpusname>/<recording>/<segment id>. Providing a sequence list can be omitted, if the set of sequence tags is the same for all datasets. When using multiple ExternSprintDataset instances, the sprint segment file can be provided as sequence list. In this case the MetaDataset assumes that the sequences with equal tag correspond to each other. This e.g. works when combining TranslationDatasets if all the text files are sentence aligned.

Example of Sequence List:

{ 'sprint': [
    'corpus/ted_1/1',
    'corpus/ted_1/2',
    'corpus/ted_1/3',
    'corpus/ted_1/4',
'translation': [
    'line-0',
    'line-1',
    'line-2',
    'line-3']
}

Python dict stored in pickle file. E.g. the sequence tagged with ‘corpus/ted_1/3’ in the ‘sprint’ dataset corresponds to the sequence tagged ‘line-2’ in the ‘translation’ dataset.

Example of MetaDataset config:

train = {"class": "MetaDataset", "seq_list_file": "seq_list.pkl",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {"data": ("sprint", "data"),
         "target_text_sprint": ("sprint", "orth_classes"),
         "source_text": ("translation", "data"),
         "target_text": ("translation", "classes")},
         "seq_ordering": "random",
         "partition_epoch": 2,
}

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Parameters:
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
  • data_map (dict[str,(str,str)]) – self-data-key -> (dataset-key, dataset-data-key). Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
  • seq_list_file (str|None) –

    filename. pickle. dict[str,list[str]], dataset-key -> list of sequence tags. Can be None if tag format is the same for all datasets.

    Then the sequence list will be default sequence order of default dataset (data_map["data"][0]), or seq_order_control_dataset.
  • seq_order_control_dataset (str|None) – if set, this dataset will define the order for each epoch.
  • seq_lens_file (str|None) – filename. json. dict[str,dict[str,int]], seq-tag -> data-key -> len. Use if getting sequence length from loading data is too costly.
  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.
  • data_dtypes (dict[str,str]) – self-data-key -> dtype. Read from data if not specified.
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • seq_list (list[str]|None) –
Return type:

bool

get_seq_length(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:NumbersDict
get_tag(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
get_target_list(self)[source]
Return type:list[str]
get_data_shape(self, data_key)[source]
Parameters:data_key (str) –
Return type:list[int]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
class MetaDataset.ClusteringDataset(dataset, cluster_map_file, n_clusters, single_cluster=False, **kwargs)[source]

This is a special case of MetaDataset, with one main subdataset, and we add a cluster-idx for each seq. We will read the cluster-map (seq-name -> cluster-idx) here directly.

Parameters:
  • dataset (dict[str]) –
  • cluster_map_file
  • n_clusters (int) –
  • single_cluster
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int) –
  • seq_list (list[str]|int) –
Return type:

bool

get_data_keys(self)[source]
Return type:list[str]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
num_seqs[source]
Return type:int
is_less_than_num_seqs(self, n)[source]
Parameters:n (int) –
Return type:bool
get_tag(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:str
class MetaDataset.ConcatDataset(datasets, **kwargs)[source]

This concatenates multiple datasets. They are expected to provide the same data-keys and data-dimensions. It will go through the datasets always in order.

Parameters:datasets (list[dict[str]]) – list of kwargs for init_dataset
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
num_seqs[source]
Return type:int
get_target_list(self)[source]
Return type:list[str]
class MetaDataset.CombinedDataset(datasets, data_map, data_dims=None, data_dtypes=None, window=1, **kwargs)[source]

The CombinedDataset is to be used in the cases of Multi-Task Learning and Combination of Corpora. Here, in general, the datasets describe different training sequences. For each sequence, only the features of the corresponding dataset will be available. Features of the other datasets are set to empty arrays. The input parameter "datasets" is the same as for the MetaDataset. The "data_map" is reversed to allow for several datasets mapping to the same feature. The "default" "seq_ordering" is to first go through all sequences of the first dataset, then the second and so on. All other sequence orderings ("random", "sorted", "laplace", …) are supported and based on this “default” ordering. There is a special sequence ordering "random_dataset", where we pick datasets at random, while keeping the sequence order within the datasets as is. To adjust the ratio of number of training examples from the different datasets in an epoch, one can use "repeat_epoch" in some of the datasets to increase their size relative to the others. Also, "partition_epoch" in some of the datasets can be used to shrink them relative to the others.

Example of CombinedDataset config:

train = {"class": "CombinedDataset",
         "datasets": {"sprint": train_sprint, "translation": train_translation},
         "data_map": {("sprint", "data"): "data",
                      ("sprint", "orth_classes"): "orth_classes",
                      ("translation", "data"): "source_text",
                      ("translation", "classes"): "orth_classes"},
         "seq_ordering": "default",
         "partition_epoch": 2,
 }

This combines a SprintDataset and a TranslationDataset. These are defined as "train_sprint" and "train_translation" separately. Note that the current implementation expects one input feature to be called “data”.

Note: The mapping has been inverted. We now expect (dataset-key, dataset-data-key) -> self-data-key am-dataset:data -> am-data, am-dataset:classes -> am-classes, lm-dataset:data -> lm-data. For each sequence idx, it will select one of the given datasets, fill in the data-keys of this dataset and will return empty sequences for the remaining datasets. The default sequence ordering is to first go through all sequences of dataset 1, then dataset 2 and so on. If seq_ordering is set to ‘random_dataset’, we always pick one of the datasets at random (equally distributed over the sum of num-seqs), but still go through the sequences of a particular dataset in the order defined for it in the config (in order if not defined). For ‘sorted’ or ‘laplace’ the sequence length as provided by the datasets is used to sort all sequences jointly. Note, that this overrides the sequence order of the sub-datasets (also the case for ‘random’). ‘partition_epoch’ of the CombinedDataset is applied to the joint sequence order for all sequences. ‘partition_epoch’ of the sub-datasets is still applied. This can be used to adjust the relative size of the datasets. (However, do not combine ‘partition_epoch’ on both levels, as this leads to an unexpected selection of sequences.) To upscale a dataset, rather than downscaling the others via ‘partition_epoch’, use the ‘repeat_epoch’ option.

Also see MetaDataset.

Parameters:
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
  • data_map (dict[(str,str),str]) – (dataset-key, dataset-data-key) -> self-data-key. Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr). Deprecated/Only to double check. Read from data if not specified.
  • data_dtypes (dict[str,str]) – self-data-key -> dtype. Read from data if not specified.
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int) –
  • seq_list (list[str]|None) –
Return type:

bool

is_less_than_num_seqs(self, n)[source]
Parameters:n (int) –
Return type:bool
get_target_list(self)[source]
Return type:list[str]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
get_data_dim(self, key)[source]
Parameters:key (str) –
Return type:int
class MetaDataset.ConcatSeqsDataset(dataset, seq_list_file, seq_len_file, seq_tag_delim=';', remove_in_between_postfix=None, use_cache_manager=False, epoch_wise_filter=None, **kwargs)[source]

This takes another dataset, and concatenates one or multiple seqs.

Parameters:
  • dataset (dict[str]) – kwargs for init_dataset
  • seq_list_file (str) – filename. line-separated. seq_tag_delim.join(seq_tags) for concatenated seqs
  • seq_len_file (str) – file with Python dict, (single) seg_name -> len, which is used for sorting
  • seq_tag_delim (str) –
  • use_cache_manager (bool) –
  • epoch_wise_filter (dict[(int,int),dict]) – see EpochWiseFilter
  • remove_in_between_postfix (dict[str,int]|None) – data_key -> expected postfix label. e.g. {“targets”: 0}
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int) –
  • seq_list (list[str]|None) –
Return type:

bool

get_data_keys(self)[source]
Return type:list[str]
get_target_list(self)[source]
Return type:list[str]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
get_data_dim(self, key)[source]
Parameters:key (str) –
Return type:int
is_data_sparse(self, key)[source]
Parameters:key (str) –
Return type:bool
get_data_shape(self, key)[source]
Parameters:key (str) –
Return type:list[int]
class MetaDataset.ChunkShuffleDataset(dataset, chunk_shuffle_cache=1000, batch_gen_batch_size=5000, batch_gen_max_seqs=1, batch_gen_recurrent_net=True, **kwargs)[source]

This goes through a dataset, caches some recent chunks

Parameters:dataset (dict[str]) – kwargs for init_dataset
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
is_less_than_num_seqs(self, seq_idx)[source]
Return type:bool

:returns whether seq_idx < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

get_target_list(self)[source]
Return type:list[str]