GeneratingDataset

Some datasets for artificially generated data.

class GeneratingDataset.GeneratingDataset(input_dim, output_dim, num_seqs=inf, fixed_random_seed=None, **kwargs)[source]

Some base class for datasets with artificially generated data.

Parameters:
  • input_dim (int|None) –
  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys
  • num_seqs (int|float) –
  • fixed_random_seed (int) –
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:seq_list – predefined order. doesn’t make sense here

This is called when we start a new epoch, or at initialization.

is_cached(self, start, end)[source]
Parameters:
  • start (int) –
  • end (int) –
Return type:

bool

generate_seq(self, seq_idx)[source]
Return type:DatasetSeq
get_num_timesteps(self)[source]
Return type:int
num_seqs[source]
Return type:int
get_seq_length(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:Util.NumbersDict
get_data(self, seq_idx, key)[source]
Parameters:
  • seq_idx (int) –
  • key (str) –
Return type:

numpy.ndarray

get_input_data(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:numpy.ndarray
get_targets(self, target, seq_idx)[source]
Parameters:
  • seq_idx (int) –
  • target (str) –
Return type:

numpy.ndarray

get_ctc_targets(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:typing.Optional[numpy.ndarray]
get_tag(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:str
class GeneratingDataset.Task12AXDataset(**kwargs)[source]

12AX memory task. This is a simple memory task where there is an outer loop and an inner loop. Description here: http://psych.colorado.edu/~oreilly/pubs-abstr.html#OReillyFrank06

get_random_seq_len(self)[source]
Return type:int
generate_input_seq(self, seq_len)[source]

Somewhat made up probability distribution. Try to make in a way that at least some “R” will occur in the output seq. Otherwise, “R”s are really rare.

Parameters:seq_len (int) –
Return type:list[int]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
estimate_output_class_priors(self, num_trials, seq_len=10)[source]
Parameters:seq_len (int) –
Return type:(float, float)
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.TaskEpisodicCopyDataset(**kwargs)[source]

Episodic Copy memory task. This is a simple memory task where we need to remember a sequence. Described in: http://arxiv.org/abs/1511.06464 Also tested for Associative LSTMs. This is a variant where the lengths are random, both for the chars and for blanks.

generate_input_seq(self)[source]
Return type:list[int]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.TaskXmlModelingDataset(limit_stack_depth=4, **kwargs)[source]

XML modeling memory task. This is a memory task where we need to remember a stack. Defined in Jozefowicz et al. (2015). Also tested for Associative LSTMs.

generate_input_seq(self)[source]
Return type:list[int]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.TaskVariableAssignmentDataset(**kwargs)[source]

Variable Assignment memory task. This is a memory task to test for key-value retrieval. Defined in Associative LSTM paper.

generate_input_seq(self)[source]
Return type:list[int]
classmethod make_output_seq(input_seq)[source]
Return type:list[int]
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.TaskNumberBaseConvertDataset(input_base=8, output_base=2, min_input_seq_len=1, max_input_seq_len=8, **kwargs)[source]

Task: E.g: Get some number in octal and convert it to binary (e.g. “10101001”). Or basically convert some number from some base into another base.

Parameters:
  • input_base (int) –
  • output_base (int) –
  • min_input_seq_len (int) –
  • max_input_seq_len (int) –
get_random_input_seq_len(self)[source]
Return type:int
generate_input_seq(self)[source]
Return type:list[int]
make_output_seq(self, input_seq)[source]
Parameters:input_seq (list[int]) –
Return type:list[int]
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.DummyDataset(input_dim, output_dim, num_seqs, seq_len=2, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]

Some dummy data, which does not have any meaning. If you want to have artificial data with some meaning, look at other datasets here. The input are some dense data, the outputs are sparse.

Parameters:
  • input_dim (int) –
  • output_dim (int) –
  • num_seqs (int|float) –
  • seq_len (int|dict[str,int]) –
  • input_max_value (float) –
  • input_shift (float|None) –
  • input_scale (float|None) –
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.DummyDatasetMultipleSequenceLength(input_dim, output_dim, num_seqs, seq_len=None, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]

Like DummyDataset but has provides seqs with different sequence lengths.

Parameters:
  • input_dim (int) –
  • output_dim (int) –
  • num_seqs (int|float) –
  • seq_len (int|dict[str,int]) –
  • input_max_value (float) –
  • input_shift (float|None) –
  • input_scale (float|None) –
generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
class GeneratingDataset.StaticDataset(data, target_list=None, output_dim=None, input_dim=None, **kwargs)[source]

Provide all the data as a list of dict of numpy arrays.

Parameters:
  • data (list[dict[str,numpy.ndarray]]) – list of seqs, each provide the data for each data-key
  • input_dim (int|None) –
  • output_dim (int|dict[str,(int,int)|list[int]]) –
classmethod copy_from_dataset(dataset, start_seq_idx=0, max_seqs=None)[source]
Parameters:
  • dataset (Dataset) –
  • start_seq_idx (int) –
  • max_seqs (int|None) –
Return type:

StaticDataset

generate_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:DatasetSeq
get_data_keys(self)[source]
Return type:list[str]
get_target_list(self)[source]
Return type:list[str]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
class GeneratingDataset.CopyTaskDataset(nsymbols, minlen=0, maxlen=0, minlen_epoch_factor=0, maxlen_epoch_factor=0, **kwargs)[source]

Copy task. Input/output is exactly the same random sequence of sparse labels.

Parameters:
  • nsymbols (int) –
  • minlen (int) –
  • maxlen (int) –
  • minlen_epoch_factor (float) –
  • maxlen_epoch_factor (float) –
get_random_seq_len(self)[source]
Return type:int
generate_seq(self, seq_idx)[source]
Return type:DatasetSeq
class GeneratingDataset.ExtractAudioFeatures(window_len=0.025, step_len=0.01, num_feature_filters=None, with_delta=False, norm_mean=None, norm_std_dev=None, features='mfcc', feature_options=None, random_permute=None, random_state=None, raw_ogg_opts=None, post_process=None, sample_rate=None, peak_normalization=True, preemphasis=None, join_frames=None)[source]

Currently uses librosa to extract MFCC/log-mel features. (Alternatives: python_speech_features, talkbox.features.mfcc, librosa)

Parameters:
  • window_len (float) – in seconds
  • step_len (float) – in seconds
  • num_feature_filters (int) –
  • with_delta (bool|int) –
  • norm_mean (numpy.ndarray|str|int|float|None) – if str, will interpret as filename
  • norm_std_dev (numpy.ndarray|str|int|float|None) – if str, will interpret as filename
  • features (str) – “mfcc”, “log_mel_filterbank”, “log_log_mel_filterbank”, “raw”, “raw_ogg”
  • feature_options (dict[str]|None) – provide additional parameters for the feature function
  • random_permute (CollectionReadCheckCovered|dict[str]|bool|None) –
  • random_state (numpy.random.RandomState|None) –
  • raw_ogg_opts (dict[str]|None) –
  • post_process (function) –
  • sample_rate (int|None) –
  • peak_normalization (bool) – set to False to disable the peak normalization for audio files
  • preemphasis (float|None) – set a preemphasis filter coefficient
  • join_frames (int|None) – concatenate multiple frames together to a superframe
Returns:

(audio_len // int(step_len * sample_rate), (with_delta + 1) * num_feature_filters), float32

Return type:

numpy.ndarray

get_audio_features_from_raw_bytes(self, raw_bytes, seq_name=None)[source]
Parameters:
  • raw_bytes (io.BytesIO) –
  • seq_name (str|None) –
Returns:

shape (time,feature_dim)

Return type:

numpy.ndarray

get_audio_features(self, audio, sample_rate, seq_name=None)[source]
Parameters:
  • audio (numpy.ndarray) – raw audio samples, shape (audio_len,)
  • sample_rate (int) – e.g. 22050
  • seq_name (str|None) –
Returns:

array (time,dim), dim == self.get_feature_dimension()

Return type:

numpy.ndarray

get_feature_dimension(self)[source]
Return type:int
class GeneratingDataset.TimitDataset(timit_dir, train=True, preload=False, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, random_permute_audio=None, num_phones=61, demo_play_audio=False, fixed_random_seed=None, **kwargs)[source]

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. You must provide the data.

Demo:

tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’}” tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’,

‘demo_play_audio’: True, ‘random_permute_audio’: True}”

The full train data has 3696 utterances and the core test data has 192 utterances (24-speaker core test set).

For some references: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/CTC-TIMIT/train-timit.py https://www.cs.toronto.edu/~graves/preprint.pdf https://arxiv.org/pdf/1303.5778.pdf https://arxiv.org/pdf/0804.3269.pdf

Parameters:
  • timit_dir (str|None) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn
  • train (bool) – whether to use the train or core test data
  • preload (bool) – if True, here at __init__, we will wait until we loaded all the data
  • num_feature_filters (int) – e.g. number of MFCCs
  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
  • norm_mean (str) – file with mean values which are used for mean-normalization of the final features
  • norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features
  • random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio
  • num_phones (int) – 39, 48 or 61. num labels of our classes
  • demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py
  • fixed_random_seed (None|int) – if given, use this fixed random seed in every epoch
PhoneMapTo39 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'aa', 'aw': 'aw', 'ax': 'ah', 'ax-h': 'ah', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'sil', 'ch': 'ch', 'd': 'd', 'dcl': 'sil', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'l', 'em': 'm', 'en': 'n', 'eng': 'ng', 'epi': 'sil', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'sil', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ih', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'sil', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'sil', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'sil', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'sh'}[source]
PhoneMapTo48 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'vcl', 'ch': 'ch', 'd': 'd', 'dcl': 'vcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'm', 'en': 'en', 'eng': 'ng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'vcl', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'cl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'cl', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'cl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]
Phones61 = dict_keys(['aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ax-h', 'axr', 'ay', 'b', 'bcl', 'ch', 'd', 'dcl', 'dh', 'dx', 'eh', 'el', 'em', 'en', 'eng', 'epi', 'er', 'ey', 'f', 'g', 'gcl', 'h#', 'hh', 'hv', 'ih', 'ix', 'iy', 'jh', 'k', 'kcl', 'l', 'm', 'n', 'ng', 'nx', 'ow', 'oy', 'p', 'pau', 'pcl', 'q', 'r', 's', 'sh', 't', 'tcl', 'th', 'uh', 'uw', 'ux', 'v', 'w', 'y', 'z', 'zh'])[source]
PhoneMapTo61 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax-h', 'axr': 'axr', 'ay': 'ay', 'b': 'b', 'bcl': 'bcl', 'ch': 'ch', 'd': 'd', 'dcl': 'dcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'em', 'en': 'en', 'eng': 'eng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'gcl', 'h#': 'h#', 'hh': 'hh', 'hv': 'hv', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'kcl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'nx', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'pau', 'pcl': 'pcl', 'q': 'q', 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'tcl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'ux', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]
classmethod get_label_map(source_num_phones=61, target_num_phones=39)[source]
Parameters:
  • source_num_phones (int) –
  • target_num_phones (int) –
Return type:

dict[int,int|None]

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int) –
  • seq_list (list[str]|None) –
Return type:

bool

class GeneratingDataset.NltkTimitDataset(nltk_download_dir=None, **kwargs)[source]

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

This Dataset will get TIMIT via NLTK. Demo:

tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’}” tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’, ‘demo_play_audio’: True, ‘random_permute_audio’: True}”

Note: The NLTK data only contains a subset of the train data (160 utterances), and none of the test data. The full train data has 3696 utterances and the core test data has 192 utterances. Not sure how useful this is…

class GeneratingDataset.Vocabulary(vocab_file, seq_postfix=None, unknown_label='UNK', num_labels=None)[source]

Represents a vocabulary (set of words, and their ids). Used by BytePairEncoding.

Parameters:
  • vocab_file (str) –
  • unknown_label (str|None) –
  • num_labels (int) – just for verification
  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
classmethod create_vocab(**opts)[source]
Parameters:opts – kwargs for class
Return type:Vocabulary|BytePairEncoding|CharacterTargets
classmethod create_vocab_dict_from_labels(labels)[source]

This is exactly the format which we expect when we read it in self._parse_vocab.

Parameters:labels (list[str]) –
Return type:dict[str,int]
tf_get_init_variable_func(self, var)[source]
Parameters:var (tensorflow.Variable) –
Return type:(tensorflow.Session)->None
get_seq(self, sentence)[source]
Parameters:sentence (str) – assumed to be seq of vocab entries separated by whitespace
Return type:list[int]
get_seq_indices(self, seq)[source]
Parameters:seq (list[str]) –
Return type:list[int]
get_seq_labels(self, seq)[source]
Parameters:seq (list[int]) –
Return type:str
class GeneratingDataset.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, unknown_label='UNK')[source]

Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.

Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:
  • vocab_file (str) –
  • bpe_file (str) –
  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
  • unknown_label (str|None) –
check_vocab_and_split(self, orig, bpe_codes, vocab, separator)[source]

Check for each segment in word if it is in-vocabulary, and segment OOV segments into smaller units by reversing the BPE merge operations

recursive_split(self, segment, bpe_codes, vocab, separator, final=False)[source]

Recursively split segment into smaller units (by reversing BPE merges) until all units are either in-vocabulary, or cannot be split further.

get_seq(self, sentence)[source]
Parameters:sentence (str) –
Return type:list[int]
class GeneratingDataset.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@')[source]

Uses characters as target labels.

Parameters:
  • vocab_file (str) –
  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
  • unknown_label (str|None) –
get_seq(self, sentence)[source]
Parameters:sentence (str) –
Return type:list[int]
class GeneratingDataset.BlissDataset(path, vocab_file, bpe_file=None, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, **kwargs)[source]

Reads in a Bliss XML corpus (similar to LmDataset), and provides the features (similar to TimitDataset) and the orthography as words, subwords or chars (similar to TranslationDataset).

Example:
./tools/dump-dataset.py “
{‘class’:’BlissDataset’,
‘path’: ‘/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz’, ‘bpe_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-bpe-codes’, ‘vocab_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-vocab’}”
Parameters:
  • path (str) – path to XML. can also be gzipped.
  • vocab_file (str) – path to vocabulary file. Python-str which evals to dict[str,int]
  • bpe_file (str) – Byte-pair encoding file
  • num_feature_filters (int) – e.g. number of MFCCs
  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree
class SeqInfo[source]

Covers all relevant seq info.

audio_end[source]
audio_path[source]
audio_start[source]
idx[source]
orth_raw[source]
orth_seq[source]
tag[source]
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

class GeneratingDataset.LibriSpeechCorpus(path, prefix, audio, orth_post_process=None, targets=None, chars=None, bpe=None, use_zip=False, use_ogg=False, use_cache_manager=False, fixed_random_seed=None, fixed_random_subset=None, epoch_wise_filter=None, name=None, **kwargs)[source]

LibriSpeech. http://www.openslr.org/12/

“train-*” Seq-length ‘data’ Stats (default MFCC, every 10ms):
281241 seqs Mean: 1230.94154835176 Std dev: 383.5126785278322 Min/max: 84 / 2974
“train-*” Seq-length ‘classes’ Stats (BPE with 10k symbols):
281241 seqs Mean: 58.46585312952222 Std dev: 20.54464373013634 Min/max: 1 / 161

“train-*” mean transcription len: 177.009085 (chars), i.e. ~3 chars per BPE label

Parameters:
  • path (str) – dir, should contain “train-///{.flac,*.trans.txt}”, or “train-*.zip”
  • prefix (str) – “train”, “dev”, “test”, “dev-clean”, “dev-other”, …
  • orth_post_process (str|list[str]|None) – get_post_processor_function(), applied on orth
  • targets (str|None) – “bpe” or “chars” currently, if None, then “bpe”
  • audio (dict[str]|None) – options for ExtractAudioFeatures
  • bpe (dict[str]|None) – options for BytePairEncoding
  • chars (dict[str]|None) – options for CharacterTargets
  • use_zip (bool) – whether to use the ZIP files instead (better for NFS)
  • use_ogg (bool) – add .ogg postfix to all files
  • use_cache_manager (bool) – uses Util.cf()
  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used
  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. it’s deterministic.
  • epoch_wise_filter (dict|None) – see init_seq_order
init_seq_order(self, epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • seq_list (list[str]|None) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

get_current_seq_order(self)[source]
Return type:list[int]
have_corpus_seq_idx(self)[source]
Return type:bool
get_corpus_seq_idx(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:int
get_tag(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:str
get_all_tags(self)[source]
Return type:list[str]
get_total_num_seqs(self)[source]
Return type:int
class GeneratingDataset.OggZipDataset(path, audio, targets, targets_post_process=None, use_cache_manager=False, fixed_random_seed=None, fixed_random_subset=None, epoch_wise_filter=None, **kwargs)[source]

Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the audio option, which is passed to ExtractAudioFeatures. Does also support Wav files, and might even support other file formats readable by the ‘soundfile’ library (not tested). By setting audio or targets to None, the dataset can be used in text only or audio only mode. The content of the zip file is:

  • a .txt file with the same name as the zipfile, containing a python list of dictionaries
  • a subfolder with the same name as the zipfile, containing the audio files

The dictionaries in the .txt file must have the following structure:

[{'seq_name': 'arbitrary_sequence_name', 'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'}, ...]

If seq_name is not included, the seq_tag will be the name of the file. duration is mandatory, as this information is needed for the sequence sorting.

Parameters:
  • path (str) – filename to zip
  • audio (dict[str]|None) – options for ExtractAudioFeatures. use {} for default. None means to disable.
  • targets (dict[str]|None) – options for Vocabulary.create_vocab() (e.g. BytePairEncoding)
  • targets_post_process (str|list[str]|((str)->str)|None) – get_post_processor_function(), applied on orth
  • use_cache_manager (bool) – uses Util.cf()
  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used
  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. it’s deterministic.
  • epoch_wise_filter (dict|None) – see init_seq_order
init_seq_order(self, epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • seq_list (list[str]|None) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

get_current_seq_order(self)[source]
Return type:list[int]
have_corpus_seq_idx(self)[source]
Return type:bool
get_corpus_seq_idx(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:int
get_tag(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:str
get_all_tags(self)[source]
Return type:list[str]
get_total_num_seqs(self)[source]
Return type:int
class GeneratingDataset.Enwik8Corpus(path, subset, seq_len, fixed_random_seed=None, batch_num_seqs=None, subsubset=None, **kwargs)[source]

enwik8

Parameters:
  • path (str) –
  • subset (str) – “training”, “validation”, “test”
  • seq_len (int) –
  • fixed_random_seed (int|None) –
  • batch_num_seqs (int|None) – if given, will not shuffle the data but have it in such order, that with a given batch num_seqs setting, you could reuse the hidden state in an RNN
  • subsubset (int|(int,int)|None) – end, (start,end), or full
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int) –
  • seq_list (list[str]|None) –
Return type:

bool

GeneratingDataset.demo()[source]

Some demo for some of the GeneratingDataset.