returnn.datasets.generating
#
Some datasets for artificially generated data.
- class returnn.datasets.generating.GeneratingDataset(input_dim, output_dim, num_seqs=inf, **kwargs)[source]#
Some base class for datasets with artificially generated data.
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
num_seqs (int|float) –
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
- Parameters:
seq_list (list[str]|None) – predefined order via tags, doesn’t make sense here
seq_order (list[int]|None) – predefined order via indices, doesn’t make sense here
This is called when we start a new epoch, or at initialization.
- have_get_corpus_seq() bool [source]#
- Returns:
whether we have
get_corpus_seq()
- get_corpus_seq(corpus_seq_idx: int) DatasetSeq [source]#
- Parameters:
corpus_seq_idx –
- Returns:
seq
- generate_seq(seq_idx: int) DatasetSeq [source]#
This assumes that self.random is already initialized and seeded to sth deterministic for the given seq_idx and epoch.
- Parameters:
seq_idx – corpus seq idx
- have_corpus_seq_idx()[source]#
- Returns:
whether we have
get_corpus_seq_idx()
- class returnn.datasets.generating.Task12AXDataset(**kwargs)[source]#
12AX memory task. This is a simple memory task where there is an outer loop and an inner loop. Description here: https://psych.colorado.edu/~oreilly/pubs-abstr.html#OReillyFrank06
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
num_seqs (int|float) –
- generate_input_seq(seq_len)[source]#
Somewhat made up probability distribution. Try to make in a way that at least some “R” will occur in the output seq. Otherwise, “R”s are really rare.
- Parameters:
seq_len (int) –
- Return type:
list[int]
- estimate_output_class_priors(num_trials, seq_len=10)[source]#
- Parameters:
seq_len (int) –
- Return type:
(float, float)
- class returnn.datasets.generating.TaskEpisodicCopyDataset(**kwargs)[source]#
Episodic Copy memory task. This is a simple memory task where we need to remember a sequence. Described in: https://arxiv.org/abs/1511.06464 Also tested for Associative LSTMs. This is a variant where the lengths are random, both for the chars and for blanks.
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
num_seqs (int|float) –
- class returnn.datasets.generating.TaskXmlModelingDataset(limit_stack_depth=4, **kwargs)[source]#
XML modeling memory task. This is a memory task where we need to remember a stack. Defined in Jozefowicz et al. (2015). Also tested for Associative LSTMs.
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
num_seqs (int|float) –
- class returnn.datasets.generating.TaskVariableAssignmentDataset(**kwargs)[source]#
Variable Assignment memory task. This is a memory task to test for key-value retrieval. Defined in Associative LSTM paper.
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
num_seqs (int|float) –
- class returnn.datasets.generating.TaskNumberBaseConvertDataset(input_base=8, output_base=2, min_input_seq_len=1, max_input_seq_len=8, **kwargs)[source]#
Task: E.g: Get some number in octal and convert it to binary (e.g. “10101001”). Or basically convert some number from some base into another base.
- Parameters:
input_base (int) –
output_base (int) –
min_input_seq_len (int) –
max_input_seq_len (int) –
- class returnn.datasets.generating.DummyDataset(input_dim, output_dim, num_seqs, seq_len=2, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]#
Some dummy data, which does not have any meaning. If you want to have artificial data with some meaning, look at other datasets here. The input are some dense data, the outputs are sparse.
- Parameters:
input_dim (int|None) –
output_dim (int|dict[str,int|(int,int)|dict]) –
num_seqs (int|float) –
seq_len (int|dict[str,int]) –
input_max_value (float) –
input_shift (float|None) –
input_scale (float|None) –
- class returnn.datasets.generating.DummyDatasetMultipleSequenceLength(input_dim, output_dim, num_seqs, seq_len=None, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]#
Like
DummyDataset
but has provides seqs with different sequence lengths.- Parameters:
input_dim (int) –
output_dim (int) –
num_seqs (int|float) –
seq_len (int|dict[str,int]) –
input_max_value (float) –
input_shift (float|None) –
input_scale (float|None) –
- class returnn.datasets.generating.DummyDatasetMultipleDataKeys(output_dim, num_seqs, seq_len=None, input_max_value=10.0, input_shift=None, input_scale=None, data_keys=None, **kwargs)[source]#
Like
DummyDataset
this class provides dummy data without any meaning. But it extendsDummyDataset
such that it is able to provide data for multiple data keys, not only “data” and “classes” (those are also overridable, though the current implementation expects a “data” key). Further, output_dim is expected to be a dict now, which defines the data format for each data key, which also enables the user to customize whether the data is sparse or dense. It also provides the function ofDummyDatasetMultipleSequenceLength
to customize the sequence length for each data point.- Parameters:
output_dim (dict[str,int|(int,int)|dict]) – dict defining the output for each data key (e.g. {“data”: [200, 2], “classes”: [100, 1]}).
num_seqs (int|float) –
seq_len (int|dict[str,int]) – definition of the sequence length for each data key, if int the given length is used for all data keys.
input_max_value (float) –
input_shift (float|None) –
input_scale (float|None) –
data_keys (list[str]|None) – explicit declaration of the data keys, if None “data” and “classes” are used.
- class returnn.datasets.generating.StaticDataset(data, target_list=None, input_dim=None, output_dim=None, **kwargs)[source]#
Provide all the data as a list of dict of numpy arrays.
- Parameters:
data (list[dict[str,numpy.ndarray]]) – list of seqs, each provide the data for each data-key
target_list –
input_dim (int|None) –
output_dim (int|dict[str,(int,int)|list[int]]) –
- classmethod copy_from_dataset(dataset, start_seq_idx=0, max_seqs=None)[source]#
- Parameters:
dataset (Dataset) –
start_seq_idx (int) –
max_seqs (int|None) –
- Return type:
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
- Parameters:
epoch (int|None) –
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- get_all_tags()[source]#
- Returns:
list of all seq tags, of the whole dataset, without partition epoch.
- Return type:
list[str]
- class returnn.datasets.generating.CopyTaskDataset(nsymbols, minlen=0, maxlen=0, minlen_epoch_factor=0, maxlen_epoch_factor=0, **kwargs)[source]#
Copy task. Input/output is exactly the same random sequence of sparse labels.
- Parameters:
nsymbols (int) –
minlen (int) –
maxlen (int) –
minlen_epoch_factor (float) –
maxlen_epoch_factor (float) –
- class returnn.datasets.generating.TimitDataset(timit_dir, train=True, preload=False, features='mfcc', num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, random_permute_audio=None, num_phones=61, demo_play_audio=False, **kwargs)[source]#
DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. You must provide the data.
Demo:
tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’}” tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’,
‘demo_play_audio’: True, ‘random_permute_audio’: True}”
The full train data has 3696 utterances and the core test data has 192 utterances (24-speaker core test set).
The input length is not the same as the output length. The targets are not the framewise alignment.
For some references: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/CTC-TIMIT/train-timit.py https://www.cs.toronto.edu/~graves/preprint.pdf https://arxiv.org/pdf/1303.5778.pdf https://arxiv.org/pdf/0804.3269.pdf
- Parameters:
timit_dir (str|None) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn
train (bool) – whether to use the train or core test data
preload (bool) – if True, here at __init__, we will wait until we loaded all the data
features (str|function) – see
ExtractAudioFeatures
num_feature_filters (int) – e.g. number of MFCCs
with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
norm_mean (str) – file with mean values which are used for mean-normalization of the final features
norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features
random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio
num_phones (int) – 39, 48 or 61. num labels of our classes
demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py
- PhoneMapTo39 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'aa', 'aw': 'aw', 'ax': 'ah', 'ax-h': 'ah', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'sil', 'ch': 'ch', 'd': 'd', 'dcl': 'sil', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'l', 'em': 'm', 'en': 'n', 'eng': 'ng', 'epi': 'sil', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'sil', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ih', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'sil', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'sil', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'sil', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'sh'}[source]#
- PhoneMapTo48 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'vcl', 'ch': 'ch', 'd': 'd', 'dcl': 'vcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'm', 'en': 'en', 'eng': 'ng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'vcl', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'cl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'cl', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'cl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]#
- Phones61 = dict_keys(['aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ax-h', 'axr', 'ay', 'b', 'bcl', 'ch', 'd', 'dcl', 'dh', 'dx', 'eh', 'el', 'em', 'en', 'eng', 'epi', 'er', 'ey', 'f', 'g', 'gcl', 'h#', 'hh', 'hv', 'ih', 'ix', 'iy', 'jh', 'k', 'kcl', 'l', 'm', 'n', 'ng', 'nx', 'ow', 'oy', 'p', 'pau', 'pcl', 'q', 'r', 's', 'sh', 't', 'tcl', 'th', 'uh', 'uw', 'ux', 'v', 'w', 'y', 'z', 'zh'])[source]#
- PhoneMapTo61 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax-h', 'axr': 'axr', 'ay': 'ay', 'b': 'b', 'bcl': 'bcl', 'ch': 'ch', 'd': 'd', 'dcl': 'dcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'em', 'en': 'en', 'eng': 'eng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'gcl', 'h#': 'h#', 'hh': 'hh', 'hv': 'hv', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'kcl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'nx', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'pau', 'pcl': 'pcl', 'q': 'q', 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'tcl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'ux', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]#
- classmethod get_labels(num_phones=61)[source]#
- Parameters:
num_phones (int) –
- Return type:
list[str]
- classmethod get_label_map(source_num_phones=61, target_num_phones=39)[source]#
- Parameters:
source_num_phones (int) –
target_num_phones (int) –
- Return type:
dict[int,int|None]
- class returnn.datasets.generating.NltkTimitDataset(nltk_download_dir=None, **kwargs)[source]#
DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus
This Dataset will get TIMIT via NLTK. Demo:
tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’}” tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’, ‘demo_play_audio’: True, ‘random_permute_audio’: True}”
Note: The NLTK data only contains a subset of the train data (160 utterances), and none of the test data. The full train data has 3696 utterances and the core test data has 192 utterances. Not sure how useful this is…
See
TimitDataset
for more.- Parameters:
timit_dir (str|None) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn
train (bool) – whether to use the train or core test data
preload (bool) – if True, here at __init__, we will wait until we loaded all the data
features (str|function) – see
ExtractAudioFeatures
num_feature_filters (int) – e.g. number of MFCCs
with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
norm_mean (str) – file with mean values which are used for mean-normalization of the final features
norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features
random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio
num_phones (int) – 39, 48 or 61. num labels of our classes
demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py
- added_data: List[DatasetSeq][source]#
- class returnn.datasets.generating.BlissDataset(path, vocab_file, bpe_file=None, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, **kwargs)[source]#
Reads in a Bliss XML corpus (similar to
LmDataset
), and provides the features (similar toTimitDataset
) and the orthography as words, subwords or chars (similar toTranslationDataset
).- Example:
- ./tools/dump-dataset.py “
- {‘class’:’BlissDataset’,
‘path’: ‘/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz’, ‘bpe_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-bpe-codes’, ‘vocab_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-vocab’}”
- Parameters:
path (str) – path to XML. can also be gzipped.
vocab_file (str) – path to vocabulary file. Python-str which evals to dict[str,int]
bpe_file (str) – Byte-pair encoding file
num_feature_filters (int) – e.g. number of MFCCs
with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
- Parameters:
epoch (int|None) –
seq_list (list[str]|None) – Predefined order via list of tags, not used here.
seq_order (list[int]|None) – Predefined order via list of indices, not used here.
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- class returnn.datasets.generating.LibriSpeechCorpus(path, prefix, audio, orth_post_process=None, targets=None, chars=None, bpe=None, use_zip=False, use_ogg=False, use_cache_manager=False, fixed_random_subset=None, epoch_wise_filter=None, name=None, **kwargs)[source]#
LibriSpeech. https://www.openslr.org/12/
- “train-*” Seq-length ‘data’ Stats (default MFCC, every 10ms):
281241 seqs Mean: 1230.94154835176 Std dev: 383.5126785278322 Min/max: 84 / 2974
- “train-*” Seq-length ‘classes’ Stats (BPE with 10k symbols):
281241 seqs Mean: 58.46585312952222 Std dev: 20.54464373013634 Min/max: 1 / 161
“train-*” mean transcription len: 177.009085 (chars), i.e. ~3 chars per BPE label
- Parameters:
path (str) – dir, should contain “train-///{.flac,*.trans.txt}”, or “train-*.zip”
prefix (str) – “train”, “dev”, “test”, “dev-clean”, “dev-other”, …
orth_post_process (str|list[str]|None) –
get_post_processor_function()
, applied on orthtargets (str|dict[str]|None) – “bpe” or “chars” or None or dict for
Vocabulary.create_vocab()
audio (dict[str]|None) – options for
ExtractAudioFeatures
bpe (dict[str]|None) – options for
BytePairEncoding
chars (dict[str]|None) – options for
CharacterTargets
use_zip (bool) – whether to use the ZIP files instead (better for NFS)
use_ogg (bool) – add .ogg postfix to all files
use_cache_manager (bool) – uses
Util.cf()
fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. it’s deterministic.
epoch_wise_filter (dict|None) – see init_seq_order
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.
- Parameters:
epoch (int|None) –
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- class returnn.datasets.generating.Enwik8Corpus(path, subset, seq_len, batch_num_seqs=None, subsubset=None, **kwargs)[source]#
enwik8
- Parameters:
path (str) –
subset (str) – “training”, “validation”, “test”
seq_len (int) –
batch_num_seqs (int|None) – if given, will not shuffle the data but have it in such order, that with a given batch num_seqs setting, you could reuse the hidden state in an RNN
subsubset (int|(int,int)|None) – end, (start,end), or full
- returnn.datasets.generating.demo()[source]#
Some demo for some of the
GeneratingDataset
.