GeneratingDataset

class GeneratingDataset.GeneratingDataset(input_dim, output_dim, num_seqs=inf, fixed_random_seed=None, **kwargs)[source]
Parameters:
  • input_dim (int|None) –
  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
  • num_seqs (int|float) –
  • fixed_random_seed (int) –
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:seq_list – predefined order. doesn’t make sense here

This is called when we start a new epoch, or at initialization.

is_cached(start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

generate_seq(seq_idx)[source]
Return type:DatasetSeq
get_num_timesteps()[source]
num_seqs[source]
get_seq_length(sorted_seq_idx)[source]
Return type:NumbersDict
get_data(seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_input_data(seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_targets(target, seq_idx)[source]
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_ctc_targets(sorted_seq_idx)[source]
get_tag(sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
class GeneratingDataset.Task12AXDataset(**kwargs)[source]

12AX memory task. This is a simple memory task where there is an outer loop and an inner loop. Description here: http://psych.colorado.edu/~oreilly/pubs-abstr.html#OReillyFrank06

get_random_seq_len()[source]
generate_input_seq(seq_len)[source]

Somewhat made up probability distribution. Try to make in a way that at least some “R” will occur in the output seq. Otherwise, “R”s are really rare.

classmethod make_output_seq(input_seq)[source]
Return type:list[int]
estimate_output_class_priors(num_trials, seq_len=10)[source]
Return type:(float, float)
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.TaskEpisodicCopyDataset(**kwargs)[source]

Episodic Copy memory task. This is a simple memory task where we need to remember a sequence. Described in: http://arxiv.org/abs/1511.06464 Also tested for Associative LSTMs. This is a variant where the lengths are random, both for the chars and for blanks.

generate_input_seq()[source]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.TaskXmlModelingDataset(limit_stack_depth=4, **kwargs)[source]

XML modeling memory task. This is a memory task where we need to remember a stack. Defined in Jozefowicz et al. (2015). Also tested for Associative LSTMs.

generate_input_seq()[source]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.TaskVariableAssignmentDataset(**kwargs)[source]

Variable Assignment memory task. This is a memory task to test for key-value retrieval. Defined in Associative LSTM paper.

generate_input_seq()[source]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.DummyDataset(input_dim, output_dim, num_seqs, seq_len=2, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.StaticDataset(data, target_list=None, output_dim=None, input_dim=None, **kwargs)[source]

Provide all the data as a list of dict of numpy arrays.

Parameters:
  • data (list[dict[str,numpy.ndarray]]) – list of seqs, each provide the data for each data-key
  • input_dim (int|None) –
  • output_dim (int|dict[str,(int,int)|list[int]]) –
classmethod copy_from_dataset(dataset, start_seq_idx=0, max_seqs=None)[source]
Parameters:
  • dataset (Dataset) –
  • start_seq_idx (int) –
  • max_seqs (int|None) –
Return type:

StaticDataset

generate_seq(seq_idx)[source]
Return type:DatasetSeq
get_data_keys()[source]
get_target_list()[source]
class GeneratingDataset.CopyTaskDataset(nsymbols, minlen=0, maxlen=0, minlen_epoch_factor=0, maxlen_epoch_factor=0, **kwargs)[source]
get_random_seq_len()[source]
generate_seq(seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.ExtractAudioFeatures(window_len=0.025, step_len=0.01, num_feature_filters=40, with_delta=False, norm_mean=None, norm_std_dev=None, features='mfcc', random_permute=None, random_state=None)[source]

Currently uses librosa to extract MFCC features. (Alternatives: python_speech_features, talkbox.features.mfcc, librosa) We could also add support e.g. to directly extract log-filterbanks or so.

Parameters:
  • window_len (float) – in seconds
  • step_len (float) – in seconds
  • num_feature_filters (int) –
  • with_delta (bool|int) –
  • norm_mean (numpy.ndarray|str|None) – if str, will interpret as filename
  • norm_std_dev (numpy.ndarray|str|None) – if str, will interpret as filename
  • features (str) – “mfcc”, “log_mel_filterbank”, “log_log_mel_filterbank”
  • random_permute (CollectionReadCheckCovered|dict[str]|bool|None) –
  • random_state (numpy.random.RandomState|None) –
Returns:

(audio_len // int(step_len * sample_rate), (with_delta + 1) * num_feature_filters), float32

Return type:

numpy.ndarray

get_audio_features(audio, sample_rate)[source]
Parameters:
  • audio (numpy.ndarray) – raw audio samples, shape (audio_len,)
  • sample_rate (int) – e.g. 22050
Return type:

numpy.ndarray

get_feature_dimension()[source]
class GeneratingDataset.TimitDataset(timit_dir, train=True, preload=False, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, random_permute_audio=None, num_phones=61, demo_play_audio=False, fixed_random_seed=None, **kwargs)[source]

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. You must provide the data.

Demo:

tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’}” tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’, ‘demo_play_audio’: True, ‘random_permute_audio’: True}”

The full train data has 3696 utterances and the core test data has 192 utterances (24-speaker core test set).

For some references: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/CTC-TIMIT/train-timit.py https://www.cs.toronto.edu/~graves/preprint.pdf https://arxiv.org/pdf/1303.5778.pdf https://arxiv.org/pdf/0804.3269.pdf

Parameters:
  • timit_dir (str) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn
  • train (bool) – whether to use the train or core test data
  • preload (bool) – if True, here at __init__, we will wait until we loaded all the data
  • num_feature_filters (int) – e.g. number of MFCCs
  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
  • norm_mean (str) – file with mean values which are used for mean-normalization of the final features
  • norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features
  • random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio
  • num_phones (int) – 39, 48 or 61. num labels of our classes
  • demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py
  • fixed_random_seed (None|int) – if given, use this fixed random seed in every epoch
PhoneMapTo39 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'aa', 'aw': 'aw', 'ax': 'ah', 'ax-h': 'ah', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'sil', 'ch': 'ch', 'd': 'd', 'dcl': 'sil', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'l', 'em': 'm', 'en': 'n', 'eng': 'ng', 'epi': 'sil', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'sil', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ih', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'sil', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'sil', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'sil', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'sh'}[source]
PhoneMapTo48 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'vcl', 'ch': 'ch', 'd': 'd', 'dcl': 'vcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'm', 'en': 'en', 'eng': 'ng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'vcl', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'cl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'cl', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'cl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]
Phones61 = dict_keys(['iy', 'hv', 'p', 'gcl', 'ao', 'uw', 'pcl', 'uh', 'l', 'eh', 'v', 'z', 'g', 'ae', 'd', 'ax', 't', 'zh', 'ih', 'nx', 'ng', 'b', 'axr', 'm', 'sh', 'k', 'f', 'oy', 'th', 'el', 'w', 'h#', 'y', 'ow', 'dh', 'r', 'q', 'aw', 'dx', 'ey', 'aa', 'en', 'em', 'n', 's', 'ay', 'ux', 'ix', 'dcl', 'epi', 'kcl', 'tcl', 'bcl', 'er', 'ah', 'ch', 'eng', 'pau', 'ax-h', 'jh', 'hh'])[source]
PhoneMapTo61 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax-h', 'axr': 'axr', 'ay': 'ay', 'b': 'b', 'bcl': 'bcl', 'ch': 'ch', 'd': 'd', 'dcl': 'dcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'em', 'en': 'en', 'eng': 'eng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'gcl', 'h#': 'h#', 'hh': 'hh', 'hv': 'hv', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'kcl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'nx', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'pau', 'pcl': 'pcl', 'q': 'q', 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'tcl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'ux', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]
classmethod get_label_map(source_num_phones=61, target_num_phones=39)[source]
Parameters:
  • source_num_phones (int) –
  • target_num_phones (int) –
Return type:

dict[int,int|None]

init_seq_order(epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

class GeneratingDataset.NltkTimitDataset(nltk_download_dir=None, **kwargs)[source]

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

This Dataset will get TIMIT via NLTK. Demo:

tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’}” tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’, ‘demo_play_audio’: True, ‘random_permute_audio’: True}”

Note: The NLTK data only contains a subset of the train data (160 utterances), and none of the test data. The full train data has 3696 utterances and the core test data has 192 utterances. Not sure how useful this is…

class GeneratingDataset.Vocabulary(vocab_file, unknown_label='UNK', num_labels=None)[source]

Represents a vocabulary (set of words, and their ids). Used by BytePairEncoding.

Parameters:
  • vocab_file (str) –
  • unknown_label (str) –
  • num_labels (int) – just for verification
classmethod create_vocab(**opts)[source]
Parameters:opts – kwargs for class
Return type:Vocabulary|BytePairEncoding
classmethod create_vocab_dict_from_labels(labels)[source]

This is exactly the format which we expect when we read it in self._parse_vocab.

Parameters:labels (list[str]) –
Return type:dict[str,int]
tf_get_init_variable_func(var)[source]
Parameters:var (tensorflow.Variable) –
Return type:(tensorflow.Session)->None
get_seq(sentence)[source]
Parameters:sentence (str) – assumed to be seq of vocab entries separated by whitespace
Return type:list[int]
get_seq_indices(seq)[source]
Parameters:seq (list[str]) –
Return type:list[int]
get_seq_labels(seq)[source]
Parameters:seq (list[int]) –
Return type:str
class GeneratingDataset.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, unknown_label='UNK')[source]

Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.

Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:
  • vocab_file (str) –
  • bpe_file (str) –
  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
  • unknown_label (str) –
check_vocab_and_split(orig, bpe_codes, vocab, separator)[source]

Check for each segment in word if it is in-vocabulary, and segment OOV segments into smaller units by reversing the BPE merge operations

recursive_split(segment, bpe_codes, vocab, separator, final=False)[source]

Recursively split segment into smaller units (by reversing BPE merges) until all units are either in-vocabulary, or cannot be split futher.

get_seq(sentence)[source]
Parameters:sentence (str) –
Return type:list[int]
class GeneratingDataset.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@')[source]

Uses characters as target labels.

Parameters:
  • vocab_file (str) –
  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
  • unknown_label (str) –
get_seq(sentence)[source]
Parameters:sentence (str) –
Return type:list[int]
class GeneratingDataset.BlissDataset(path, vocab_file, bpe_file=None, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, **kwargs)[source]

Reads in a Bliss XML corpus (similar as LmDataset), and provides the features (similar as TimitDataset) and the orthography as words, subwords or chars (similar as TranslationDataset).

Example:
./tools/dump-dataset.py “
{‘class’:’BlissDataset’,
‘path’: ‘/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz’, ‘bpe_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-bpe-codes’, ‘vocab_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-vocab’}”
Parameters:
  • path (str) – path to XML. can also be gzipped.
  • vocab_file (str) – path to vocabulary file. Python-str which evals to dict[str,int]
  • bpe_file (str) – Byte-pair encoding file
  • num_feature_filters (int) – e.g. number of MFCCs
  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
class SeqInfo[source]
audio_end[source]
audio_path[source]
audio_start[source]
idx[source]
orth_raw[source]
orth_seq[source]
tag[source]
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

class GeneratingDataset.LibriSpeechCorpus(path, prefix, audio, orth_post_process=None, targets=None, chars=None, bpe=None, use_zip=False, use_ogg=False, use_cache_manager=False, partition_epoch=None, fixed_random_seed=None, fixed_random_subset=None, epoch_wise_filter=None, name=None, **kwargs)[source]

LibriSpeech. http://www.openslr.org/12/

“train-*” Seq-length ‘data’ Stats (default MFCC, every 10ms):
281241 seqs Mean: 1230.94154835176 Std dev: 383.5126785278322 Min/max: 84 / 2974
“train-*” Seq-length ‘classes’ Stats (BPE with 10k symbols):
281241 seqs Mean: 58.46585312952222 Std dev: 20.54464373013634 Min/max: 1 / 161

“train-*” mean transcription len: 177.009085 (chars), i.e. ~3 chars per BPE label

Parameters:
  • path (str) – dir, should contain “train-///{.flac,*.trans.txt}”, or “train-*.zip”
  • prefix (str) – “train”, “dev”, “test”, “dev-clean”, “dev-other”, …
  • orth_post_process (str|list[str]|None) – get_post_processor_function(), applied on orth
  • targets (str|None) – “bpe” or “chars” currently, if None, then “bpe”
  • audio (dict[str]) – options for ExtractAudioFeatures
  • bpe (dict[str]) – options for BytePairEncoding
  • chars (dict[str]) – options for CharacterTargets
  • use_zip (bool) – whether to use the ZIP files instead (better for NFS)
  • use_ogg (bool) – add .ogg postfix to all files
  • use_cache_manager (bool) – uses Util.cf()
  • partition_epoch (int|None) –
  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used
  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. its deterministic.
  • epoch_wise_filter (dict|None) – see init_seq_order
init_seq_order(epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • seq_list (list[str]|None) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

have_corpus_seq_idx()[source]
Return type:bool
Returns:whether you can call self.get_corpus_seq_idx()
get_corpus_seq_idx(seq_idx)[source]
Parameters:seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:the sequence index as-is in the original corpus. only defined if self.have_corpus_seq_idx()
Return type:int
get_tag(seq_idx)[source]
Parameters:seq_idx (int) –
Return type:str
class GeneratingDataset.Enwik8Corpus(path, subset, seq_len, fixed_random_seed=None, batch_num_seqs=None, subsubset=None, partition_epoch=None, **kwargs)[source]

enwik8

Parameters:
  • path (str) –
  • subset (str) – “training”, “validation”, “test”
  • seq_len (int) –
  • fixed_random_seed (int|None) –
  • batch_num_seqs (int|None) – if given, will not shuffle the data but have it in such order, that with a given batch num_seqs setting, you could reuse the hidden state in an RNN
  • subsubset (int|(int,int)|None) – end, (start,end), or full
  • partition_epoch (int|None) –
get_data_dtype(key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:dtype as str, e.g. “int32” or “float32”
Return type:str
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

GeneratingDataset.demo()[source]