MetaDataset

class MetaDataset.ChunkShuffleDataset(dataset, chunk_shuffle_cache=1000, batch_gen_batch_size=5000, batch_gen_max_seqs=1, batch_gen_recurrent_net=True, **kwargs)[source]

This goes through a dataset, caches some recent chunks

Parameters:dataset (dict[str]) – kwargs for init_dataset
get_target_list()[source]
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
is_less_than_num_seqs(seq_idx)[source]
Return type:bool

:returns whether seq_idx < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

class MetaDataset.ClusteringDataset(dataset, cluster_map_file, n_clusters, single_cluster=False, **kwargs)[source]

This is a special case of MetaDataset, with one main subdataset, and we add a cluster-idx for each seq. We will read the cluster-map (seq-name -> cluster-idx) here directly.

get_data_dtype(key)[source]
get_data_keys()[source]
get_tag(seq_idx)[source]
init_seq_order(epoch=None, seq_list=None)[source]
is_less_than_num_seqs(n)[source]
num_seqs[source]
class MetaDataset.CombinedDataset(datasets, data_map, data_dims, data_dtypes=None, window=1, **kwargs)[source]

This combines multiple different datasets, which provide different data-sources. E.g. one can provide am-dataset with data:acoustic-features -> classes:characters (acoustic model train data), and lm-dataset provides just data:characters (language model train data). Note: The mapping has been inverted. We now expect (dataset-key, dataset-data-key) -> self-data-key am-dataset:data -> am-data, am-dataset:classes -> am-classes, lm-dataset:data -> lm-data. For each sequence idx, it will select one of the given datasets, fill in the data-keys of this dataset and will return empty sequences for the remaining datasets. The selection of the dataset will be random and equally distributed, over the sum of num-seqs.

Parameters:
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
  • data_map (dict[(str,str),str]) – (dataset-key, dataset-data-key) -> self-data-key. Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr).
  • data_dtypes (dict[str,str]) – self-data-key -> dtype. automatic if not specified
get_data_dim(key)[source]
get_data_dtype(key)[source]
get_target_list()[source]
init_seq_order(epoch=None, seq_list=None)[source]
is_less_than_num_seqs(n)[source]
class MetaDataset.ConcatDataset(datasets, **kwargs)[source]

This concatenates multiple datasets. They are expected to provide the same data-keys and data-dimensions. It will go through the datasets always in order.

Parameters:datasets (list[dict[str]]) – list of kwargs for init_dataset
get_target_list()[source]
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
num_seqs[source]
class MetaDataset.MetaDataset(seq_list_file, seq_lens_file, datasets, data_map, data_dims, data_dtypes=None, window=1, **kwargs)[source]

This wraps around one or multiple datasets and might provide extra information. Every dataset is expected to provide the the same sequences, where the sequence list is given by a file.

Parameters:
  • seq_list_file (str) – filename. line-separated
  • seq_lens_file (str) – filename. json. dict[str,dict[str,int]], seq-tag -> data-key -> len
  • datasets (dict[str,dict[str]]) – dataset-key -> dataset-kwargs. including keyword ‘class’ and maybe ‘files’
  • data_map (dict[str,(str,str)]) – self-data-key -> (dataset-key, dataset-data-key). Should contain ‘data’ as key. Also defines the target-list, which is all except ‘data’.
  • data_dims (dict[str,(int,int)]) – self-data-key -> data-dimension, len(shape) (1 ==> sparse repr).
  • data_dtypes (dict[str,str]) – self-data-key -> dtype. automatic if not specified
get_data_dtype(key)[source]
get_seq_length(sorted_seq_idx)[source]
get_tag(sorted_seq_idx)[source]
get_target_list()[source]
init_seq_order(epoch=None, seq_list=None)[source]