returnn.datasets.generating#

Some datasets for artificially generated data.

class returnn.datasets.generating.GeneratingDataset(input_dim, output_dim, num_seqs=inf, **kwargs)[source]#

Some base class for datasets with artificially generated data.

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys

  • num_seqs (int|float) –

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – predefined order via tags

  • seq_order (list[int]|None) – predefined order via indices

This is called when we start a new epoch, or at initialization.

is_cached(start, end)[source]#
Parameters:
  • start (int) –

  • end (int) –

Return type:

bool

have_get_corpus_seq() bool[source]#
Returns:

whether we have get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) DatasetSeq[source]#
Parameters:

corpus_seq_idx

Returns:

seq

generate_seq(seq_idx: int) DatasetSeq[source]#

This assumes that self.random is already initialized and seeded to sth deterministic for the given seq_idx and epoch.

Parameters:

seq_idx – corpus seq idx

get_num_timesteps()[source]#
Return type:

int

property num_seqs: int[source]#
Returns:

num seqs for current epoch

get_total_num_seqs() int[source]#
Returns:

total num seqs

have_corpus_seq_idx()[source]#
Returns:

whether we have get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx: int) int[source]#
Parameters:

seq_idx

Returns:

corpus seq idx

get_seq_length(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

returnn.util.NumbersDict

get_data(seq_idx, key)[source]#
Parameters:
  • seq_idx (int) –

  • key (str) –

Return type:

numpy.ndarray

get_input_data(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

numpy.ndarray

get_targets(target, seq_idx)[source]#
Parameters:
  • seq_idx (int) –

  • target (str) –

Return type:

numpy.ndarray

get_tag(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

str

get_all_tags()[source]#
Return type:

list[str]

get_current_seq_order() Sequence[int][source]#
Returns:

seq order

class returnn.datasets.generating.Task12AXDataset(**kwargs)[source]#

12AX memory task. This is a simple memory task where there is an outer loop and an inner loop. Description here: https://psych.colorado.edu/~oreilly/pubs-abstr.html#OReillyFrank06

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys

  • num_seqs (int|float) –

get_random_seq_len()[source]#
Return type:

int

generate_input_seq(seq_len)[source]#

Somewhat made up probability distribution. Try to make in a way that at least some “R” will occur in the output seq. Otherwise, “R”s are really rare.

Parameters:

seq_len (int) –

Return type:

list[int]

classmethod make_output_seq(input_seq)[source]#
Return type:

list[int]

estimate_output_class_priors(num_trials, seq_len=10)[source]#
Parameters:

seq_len (int) –

Return type:

(float, float)

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.TaskEpisodicCopyDataset(**kwargs)[source]#

Episodic Copy memory task. This is a simple memory task where we need to remember a sequence. Described in: https://arxiv.org/abs/1511.06464 Also tested for Associative LSTMs. This is a variant where the lengths are random, both for the chars and for blanks.

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys

  • num_seqs (int|float) –

generate_input_seq()[source]#
Return type:

list[int]

classmethod make_output_seq(input_seq)[source]#
Return type:

list[int]

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.TaskXmlModelingDataset(limit_stack_depth=4, **kwargs)[source]#

XML modeling memory task. This is a memory task where we need to remember a stack. Defined in Jozefowicz et al. (2015). Also tested for Associative LSTMs.

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys

  • num_seqs (int|float) –

generate_input_seq()[source]#
Return type:

list[int]

classmethod make_output_seq(input_seq)[source]#
Return type:

list[int]

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.TaskVariableAssignmentDataset(**kwargs)[source]#

Variable Assignment memory task. This is a memory task to test for key-value retrieval. Defined in Associative LSTM paper.

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) – if dict, can specify all data-keys

  • num_seqs (int|float) –

generate_input_seq()[source]#
Return type:

list[int]

classmethod make_output_seq(input_seq)[source]#
Return type:

list[int]

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.TaskNumberBaseConvertDataset(input_base=8, output_base=2, min_input_seq_len=1, max_input_seq_len=8, **kwargs)[source]#

Task: E.g: Get some number in octal and convert it to binary (e.g. “10101001”). Or basically convert some number from some base into another base.

Parameters:
  • input_base (int) –

  • output_base (int) –

  • min_input_seq_len (int) –

  • max_input_seq_len (int) –

get_random_input_seq_len()[source]#
Return type:

int

generate_input_seq()[source]#
Return type:

list[int]

make_output_seq(input_seq)[source]#
Parameters:

input_seq (list[int]) –

Return type:

list[int]

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.DummyDataset(input_dim, output_dim, num_seqs, seq_len=2, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]#

Some dummy data, which does not have any meaning. If you want to have artificial data with some meaning, look at other datasets here. The input are some dense data, the outputs are sparse.

Parameters:
  • input_dim (int|None) –

  • output_dim (int|dict[str,int|(int,int)|dict]) –

  • num_seqs (int|float) –

  • seq_len (int|dict[str,int]) –

  • input_max_value (float) –

  • input_shift (float|None) –

  • input_scale (float|None) –

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.DummyDatasetMultipleSequenceLength(input_dim, output_dim, num_seqs, seq_len=None, input_max_value=10.0, input_shift=None, input_scale=None, **kwargs)[source]#

Like DummyDataset but has provides seqs with different sequence lengths.

Parameters:
  • input_dim (int) –

  • output_dim (int) –

  • num_seqs (int|float) –

  • seq_len (int|dict[str,int]) –

  • input_max_value (float) –

  • input_shift (float|None) –

  • input_scale (float|None) –

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.DummyDatasetMultipleDataKeys(output_dim, num_seqs, seq_len=None, input_max_value=10.0, input_shift=None, input_scale=None, data_keys=None, **kwargs)[source]#

Like DummyDataset this class provides dummy data without any meaning. But it extends DummyDataset such that it is able to provide data for multiple data keys, not only “data” and “classes” (those are also overridable, though the current implementation expects a “data” key). Further, output_dim is expected to be a dict now, which defines the data format for each data key, which also enables the user to customize whether the data is sparse or dense. It also provides the function of DummyDatasetMultipleSequenceLength to customize the sequence length for each data point.

Parameters:
  • output_dim (dict[str,int|(int,int)|dict]) – dict defining the output for each data key (e.g. {“data”: [200, 2], “classes”: [100, 1]}).

  • num_seqs (int|float) –

  • seq_len (int|dict[str,int]) – definition of the sequence length for each data key, if int the given length is used for all data keys.

  • input_max_value (float) –

  • input_shift (float|None) –

  • input_scale (float|None) –

  • data_keys (list[str]|None) – explicit declaration of the data keys, if None “data” and “classes” are used.

generate_seq(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

DatasetSeq

class returnn.datasets.generating.DummyGenericDataset(data_template: TensorDict | Dict[str, Tensor | Dict[str, Any]], num_seqs: int, *, seq_lens: None | int | Tuple[int, int] | Dict[str | Dim | None, int | Tuple[int, int]] = None, **kwargs)[source]#

Generate some random dummy data based on a tensor dict (like extern_data).

Parameters:
  • data_template – describes each tensor

  • num_seqs

  • seq_lens – either fixed seq len, or take randint. per data key, or per dim, or same for all

get_data_keys() List[str][source]#

data keys

get_target_list() List[str][source]#

target keys

get_data_dtype(key: str) str[source]#

dtype

is_data_sparse(key: str) bool[source]#

sparse

get_data_shape(key: str) List[int][source]#

:returns get_data(*, key).shape[1:], i.e. num-frames excluded

generate_seq(seq_idx: int) DatasetSeq[source]#

generate seq (assuming self.random is in a correct state)

class returnn.datasets.generating.StaticDataset(data, target_list=None, input_dim=None, output_dim=None, **kwargs)[source]#

Provide all the data as a list of dict of numpy arrays.

Parameters:
  • data (list[dict[str,numpy.ndarray]]) – list of seqs, each provide the data for each data-key

  • target_list

  • input_dim (int|None) –

  • output_dim (int|dict[str,(int,int)|list[int]]) –

classmethod copy_from_dataset(dataset, start_seq_idx=0, max_seqs=None)[source]#
Parameters:
  • dataset (Dataset) –

  • start_seq_idx (int) –

  • max_seqs (int|None) –

Return type:

StaticDataset

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() bool[source]#

supports sorting

get_data_keys()[source]#
Return type:

list[str]

get_target_list()[source]#
Return type:

list[str]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

get_total_num_seqs()[source]#
Return type:

int

get_all_tags()[source]#
Returns:

list of all seq tags, of the whole dataset, without partition epoch.

Return type:

list[str]

get_tag(sorted_seq_idx)[source]#
Parameters:

sorted_seq_idx (int) –

Return type:

str

have_corpus_seq_idx()[source]#
Return type:

bool

Returns:

whether you can call self.get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx)[source]#
Parameters:

seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering

Returns:

the sequence index as-is in the original corpus (as if you would have sorting=”default”).

Return type:

int

class returnn.datasets.generating.CopyTaskDataset(nsymbols, minlen=0, maxlen=0, minlen_epoch_factor=0, maxlen_epoch_factor=0, **kwargs)[source]#

Copy task. Input/output is exactly the same random sequence of sparse labels.

Parameters:
  • nsymbols (int) –

  • minlen (int) –

  • maxlen (int) –

  • minlen_epoch_factor (float) –

  • maxlen_epoch_factor (float) –

get_random_seq_len()[source]#
Return type:

int

generate_seq(seq_idx)[source]#
Return type:

DatasetSeq

class returnn.datasets.generating.TimitDataset(timit_dir, train=True, preload=False, features='mfcc', num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, random_permute_audio=None, num_phones=61, demo_play_audio=False, **kwargs)[source]#

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. You must provide the data.

Demo:

tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’}” tools/dump-dataset.py “{‘class’: ‘TimitDataset’, ‘timit_dir’: ‘…’,

‘demo_play_audio’: True, ‘random_permute_audio’: True}”

The full train data has 3696 utterances and the core test data has 192 utterances (24-speaker core test set).

The input length is not the same as the output length. The targets are not the framewise alignment.

For some references: https://github.com/ppwwyyxx/tensorpack/blob/master/examples/CTC-TIMIT/train-timit.py https://www.cs.toronto.edu/~graves/preprint.pdf https://arxiv.org/pdf/1303.5778.pdf https://arxiv.org/pdf/0804.3269.pdf

Parameters:
  • timit_dir (str|None) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn

  • train (bool) – whether to use the train or core test data

  • preload (bool) – if True, here at __init__, we will wait until we loaded all the data

  • features (str|function) – see ExtractAudioFeatures

  • num_feature_filters (int) – e.g. number of MFCCs

  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree

  • norm_mean (str) – file with mean values which are used for mean-normalization of the final features

  • norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features

  • random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio

  • num_phones (int) – 39, 48 or 61. num labels of our classes

  • demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py

PhoneMapTo39 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'aa', 'aw': 'aw', 'ax': 'ah', 'ax-h': 'ah', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'sil', 'ch': 'ch', 'd': 'd', 'dcl': 'sil', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'l', 'em': 'm', 'en': 'n', 'eng': 'ng', 'epi': 'sil', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'sil', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ih', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'sil', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'sil', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'sil', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'sh'}[source]#
PhoneMapTo48 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax', 'axr': 'er', 'ay': 'ay', 'b': 'b', 'bcl': 'vcl', 'ch': 'ch', 'd': 'd', 'dcl': 'vcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'm', 'en': 'en', 'eng': 'ng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'vcl', 'h#': 'sil', 'hh': 'hh', 'hv': 'hh', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'cl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'n', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'sil', 'pcl': 'cl', 'q': None, 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'cl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'uw', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]#
Phones61 = dict_keys(['aa', 'ae', 'ah', 'ao', 'aw', 'ax', 'ax-h', 'axr', 'ay', 'b', 'bcl', 'ch', 'd', 'dcl', 'dh', 'dx', 'eh', 'el', 'em', 'en', 'eng', 'epi', 'er', 'ey', 'f', 'g', 'gcl', 'h#', 'hh', 'hv', 'ih', 'ix', 'iy', 'jh', 'k', 'kcl', 'l', 'm', 'n', 'ng', 'nx', 'ow', 'oy', 'p', 'pau', 'pcl', 'q', 'r', 's', 'sh', 't', 'tcl', 'th', 'uh', 'uw', 'ux', 'v', 'w', 'y', 'z', 'zh'])[source]#
PhoneMapTo61 = {'aa': 'aa', 'ae': 'ae', 'ah': 'ah', 'ao': 'ao', 'aw': 'aw', 'ax': 'ax', 'ax-h': 'ax-h', 'axr': 'axr', 'ay': 'ay', 'b': 'b', 'bcl': 'bcl', 'ch': 'ch', 'd': 'd', 'dcl': 'dcl', 'dh': 'dh', 'dx': 'dx', 'eh': 'eh', 'el': 'el', 'em': 'em', 'en': 'en', 'eng': 'eng', 'epi': 'epi', 'er': 'er', 'ey': 'ey', 'f': 'f', 'g': 'g', 'gcl': 'gcl', 'h#': 'h#', 'hh': 'hh', 'hv': 'hv', 'ih': 'ih', 'ix': 'ix', 'iy': 'iy', 'jh': 'jh', 'k': 'k', 'kcl': 'kcl', 'l': 'l', 'm': 'm', 'n': 'n', 'ng': 'ng', 'nx': 'nx', 'ow': 'ow', 'oy': 'oy', 'p': 'p', 'pau': 'pau', 'pcl': 'pcl', 'q': 'q', 'r': 'r', 's': 's', 'sh': 'sh', 't': 't', 'tcl': 'tcl', 'th': 'th', 'uh': 'uh', 'uw': 'uw', 'ux': 'ux', 'v': 'v', 'w': 'w', 'y': 'y', 'z': 'z', 'zh': 'zh'}[source]#
classmethod get_labels(num_phones=61)[source]#
Parameters:

num_phones (int) –

Return type:

list[str]

classmethod get_label_map(source_num_phones=61, target_num_phones=39)[source]#
Parameters:
  • source_num_phones (int) –

  • target_num_phones (int) –

Return type:

dict[int,int|None]

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int) –

  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

Return type:

bool

supports_seq_order_sorting() bool[source]#

supports sorting

class returnn.datasets.generating.NltkTimitDataset(nltk_download_dir=None, **kwargs)[source]#

DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus

This Dataset will get TIMIT via NLTK. Demo:

tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’}” tools/dump-dataset.py “{‘class’: ‘NltkTimitDataset’, ‘demo_play_audio’: True, ‘random_permute_audio’: True}”

Note: The NLTK data only contains a subset of the train data (160 utterances), and none of the test data. The full train data has 3696 utterances and the core test data has 192 utterances. Not sure how useful this is…

See TimitDataset for more.

Parameters:
  • timit_dir (str|None) – directory of TIMIT. should contain train/filelist.phn and test/filelist.core.phn

  • train (bool) – whether to use the train or core test data

  • preload (bool) – if True, here at __init__, we will wait until we loaded all the data

  • features (str|function) – see ExtractAudioFeatures

  • num_feature_filters (int) – e.g. number of MFCCs

  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree

  • norm_mean (str) – file with mean values which are used for mean-normalization of the final features

  • norm_std_dev (str) – file with std dev valeus for variance-normalization of the final features

  • random_permute_audio (None|bool|dict[str]) – enables permutation on the audio. see _get_random_permuted_audio

  • num_phones (int) – 39, 48 or 61. num labels of our classes

  • demo_play_audio (bool) – plays the audio. only make sense with tools/dump-dataset.py

added_data: List[DatasetSeq][source]#
lock: RLock | None[source]#
rnd_seq_drop: Optional[Random][source]#
num_outputs: Optional[Dict[str, Tuple[int, int]]][source]#
labels: Dict[str, List[str]][source]#
class returnn.datasets.generating.BlissDataset(path, vocab_file, bpe_file=None, num_feature_filters=40, feature_window_len=0.025, feature_step_len=0.01, with_delta=False, norm_mean=None, norm_std_dev=None, **kwargs)[source]#

Reads in a Bliss XML corpus (similar to LmDataset), and provides the features (similar to TimitDataset) and the orthography as words, subwords or chars (similar to TranslationDataset).

Example:
./tools/dump-dataset.py “
{‘class’:’BlissDataset’,

‘path’: ‘/u/tuske/work/ASR/switchboard/corpus/xml/train.corpus.gz’, ‘bpe_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-bpe-codes’, ‘vocab_file’: ‘/u/zeyer/setups/switchboard/subwords/swb-vocab’}”

Parameters:
  • path (str) – path to XML. can also be gzipped.

  • vocab_file (str) – path to vocabulary file. Python-str which evals to dict[str,int]

  • bpe_file (str) – Byte-pair encoding file

  • num_feature_filters (int) – e.g. number of MFCCs

  • with_delta (bool|int) – whether to add delta features (doubles the features dim). if int, up to this degree

class SeqInfo[source]#

Covers all relevant seq info.

idx[source]#
tag[source]#
orth_raw[source]#
orth_seq[source]#
audio_path[source]#
audio_start[source]#
audio_end[source]#
init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – Predefined order via list of tags, not used here.

  • seq_order (list[int]|None) – Predefined order via list of indices, not used here.

Return type:

bool

:returns whether the order changed (True is always safe to return)

class returnn.datasets.generating.LibriSpeechCorpus(path, prefix, audio, orth_post_process=None, targets=None, chars=None, bpe=None, use_zip=False, use_ogg=False, use_cache_manager=False, fixed_random_subset=None, epoch_wise_filter=None, name=None, seq_tag_format: str = '%(subdir)s-%(speaker)i-%(chapter)i-%(seq)04i', **kwargs)[source]#

LibriSpeech. https://www.openslr.org/12/

“train-*” Seq-length ‘data’ Stats (default MFCC, every 10ms):

281241 seqs Mean: 1230.94154835176 Std dev: 383.5126785278322 Min/max: 84 / 2974

“train-*” Seq-length ‘classes’ Stats (BPE with 10k symbols):

281241 seqs Mean: 58.46585312952222 Std dev: 20.54464373013634 Min/max: 1 / 161

“train-*” mean transcription len: 177.009085 (chars), i.e. ~3 chars per BPE label

Parameters:
  • path (str) – dir, should contain “train-///{.flac,*.trans.txt}”, or “train-*.zip”

  • prefix (str) – “train”, “dev”, “test”, “dev-clean”, “dev-other”, …

  • orth_post_process (str|list[str]|None) – get_post_processor_function(), applied on orth

  • targets (str|dict[str]|None) – “bpe” or “chars” or None or dict for Vocabulary.create_vocab()

  • audio (dict[str]|None) – options for ExtractAudioFeatures

  • bpe (dict[str]|None) – options for BytePairEncoding

  • chars (dict[str]|None) – options for CharacterTargets

  • use_zip (bool) – whether to use the ZIP files instead (better for NFS)

  • use_ogg (bool) – add .ogg postfix to all files

  • use_cache_manager (bool) – uses Util.cf()

  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. it’s deterministic.

  • epoch_wise_filter (dict|None) – see init_seq_order

  • name

  • seq_tag_format – The default “%(subdir)s-%(speaker)i-%(chapter)i-%(seq)04i” gives e.g. “dev-other-116-288045-0000”. Note, via the Bliss corpus and also OggZipDataset, we have “dev-other/116-288045-0000/116-288045-0000”, so you might want to use “%(subdir)s/%(speaker)i-%(chapter)i-%(seq)04i/%(speaker)i-%(chapter)i-%(seq)04i”.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() bool[source]#

supports sorting

get_current_seq_order()[source]#
Return type:

Sequence[int]

have_corpus_seq_idx()[source]#
Return type:

bool

get_corpus_seq_idx(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

int

get_tag(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

str

get_all_tags()[source]#
Return type:

list[str]

get_total_num_seqs()[source]#
Return type:

int

class returnn.datasets.generating.Enwik8Corpus(path, subset, seq_len, batch_num_seqs=None, subsubset=None, **kwargs)[source]#

enwik8

Parameters:
  • path (str) –

  • subset (str) – “training”, “validation”, “test”

  • seq_len (int) –

  • batch_num_seqs (int|None) – if given, will not shuffle the data but have it in such order, that with a given batch num_seqs setting, you could reuse the hidden state in an RNN

  • subsubset (int|(int,int)|None) – end, (start,end), or full

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int) –

  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

Return type:

bool

returnn.datasets.generating.demo()[source]#

Some demo for some of the GeneratingDataset.