returnn.datasets.lm#

Provides LmDataset, TranslationDataset, and some related helpers.

class returnn.datasets.lm.LmDataset(corpus_file, skip_empty_lines=True, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]#

Dataset useful for language modeling. It creates index sequences for either words, characters or other orthographics symbols based on a vocabulary. Can also perform internal word to phoneme conversion with a lexicon file. Reads simple txt files or bliss xml files (also gzipped).

To use the LmDataset with words or characters, either orth_symbols_file or orth_symbols_map_file has to be specified (both is not possible). If words should be used, set word_based to True.

The LmDatasets also support the conversion of words to phonemes with the help of the LmDataset.PhoneSeqGenerator class. To enable this mode, the input parameters to LmDataset.PhoneSeqGenerator have to be provided as dict in phone_info. As a lexicon file has to specified in this dict, orth_symbols_file and orth_symbols_map_file are not used in this case.

The LmDataset does not work without providing a vocabulary with any of the above mentioned ways.

After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index (in case phone_info is not set).

Parameters:
  • corpus_file (str|()->str|list[str]|()->list[str]) – Bliss XML or line-based txt. optionally can be gzip.

  • skip_empty_lines (bool) – for line-based txt

  • orth_symbols_file (str|()->str|None) – a text file containing a list of orthography symbols

  • orth_symbols_map_file (str|()->str|None) – either a list of orth symbols, each line: “<symbol> <index>”, a python dict with {“<symbol>”: <index>, …} or a pickled dictionary

  • orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols.

  • word_based (bool) – whether to parse single words, or otherwise will be character based.

  • word_end_symbol (str|None) – If provided and if word_based is False (character based modeling), token to be used to represent word ends.

  • seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.

  • unknown_symbol (str|None) – token to represent unknown words.

  • parse_orth_opts (dict[str]|None) – kwargs for parse_orthography().

  • phone_info (dict|None) – A dict containing parameters including a lexicon file for LmDataset.PhoneSeqGenerator.

  • add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data.

  • log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.

  • log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.

  • error_on_invalid_seq (bool) – if there is a seq we would have to skip, error.

  • add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:-1].

  • delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data.

get_data_keys()[source]#
Return type:

list[str]

get_target_list()[source]#

Unfortunately, the logic is swapped around for this dataset. “data” is the original data, which is usually the target, and you would use “delayed” as inputs.

Return type:

list[str]

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() bool[source]#

supports sorting

get_total_num_seqs() int[source]#

total num seqs

returnn.datasets.lm.iter_corpus(filename, callback, skip_empty_lines=True)[source]#
Parameters:
  • filename (str) –

  • callback (((str)->None)) –

  • skip_empty_lines (bool) –

returnn.datasets.lm.read_corpus(filename, skip_empty_lines=True)[source]#
Parameters:
  • filename (str) – either Bliss XML or line-based text

  • skip_empty_lines (bool) – in case of line-based text, skip empty lines

Returns:

list of orthographies

Return type:

list[str]

class returnn.datasets.lm.AllophoneState(id=None, state=None)[source]#

Represents one allophone (phone with context) state (number, boundary). In Sprint, see AllophoneStateAlphabet::index().

Parameters:
  • id (str) – phone

  • state (int|None) –

context_history = ()[source]#
context_future = ()[source]#
boundary = 0[source]#
id = None[source]#
state = None[source]#
format()[source]#
Return type:

str

copy()[source]#
Return type:

AllophoneState

mark_initial()[source]#

Add flag to self.boundary.

mark_final()[source]#

Add flag to self.boundary.

phoneme(ctx_offset, out_of_context_id=None)[source]#
Phoneme::Id ContextPhonology::PhonemeInContext::phoneme(s16 pos) const {
if (pos == 0)

return phoneme_;

else if (pos > 0) {
if (u16(pos - 1) < context_.future.length())

return context_.future[pos - 1];

else

return Phoneme::term;

} else { verify(pos < 0);
if (u16(-1 - pos) < context_.history.length())

return context_.history[-1 - pos];

else

return Phoneme::term;

}

}

Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history

  • out_of_context_id (str|None) – what to return out of our context

Returns:

phone-id from the offset

Return type:

str

set_phoneme(ctx_offset, phone_id)[source]#
Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history

  • phone_id (str) –

phone_idx(ctx_offset, phone_idxs)[source]#
Parameters:
  • ctx_offset (int) – see self.phoneme()

  • phone_idxs (dict[str,int]) –

Return type:

int

index(phone_idxs, num_states=3, context_length=1)[source]#

See self.from_index() for the inverse function. And see Sprint NoStateTyingDense::classify().

Parameters:
  • phone_idxs (dict[str,int]) –

  • num_states (int) – how much state per allophone

  • context_length (int) – how much left/right context

Return type:

int

classmethod from_index(index, phone_ids, num_states=3, context_length=1)[source]#

Original Sprint C++ code:

Mm::MixtureIndex NoStateTyingDense::classify(const AllophoneState& a) const {

require_lt(a.allophone()->boundary, numBoundaryClasses_); require_le(0, a.state()); require_lt(u32(a.state()), numStates_); u32 result = 0; for(u32 i = 0; i < 2 * contextLength_ + 1; ++i) { // context len is usually 1

// pos sequence: 0, -1, 1, [-2, 2, …] s16 pos = i / 2; if(i % 2 == 1)

pos = -pos - 1;

result *= numPhoneClasses_; u32 phoneIdx = a.allophone()->phoneme(pos); require_lt(phoneIdx, numPhoneClasses_); result += phoneIdx;

} result *= numStates_; result += u32(a.state()); result *= numBoundaryClasses_; result += a.allophone()->boundary; require_lt(result, nClasses_); return result;

}

Note that there is also AllophoneStateAlphabet::allophoneState, via Am/ClassicStateModel.cc, which unfortunately uses a different encoding. See from_classic_index().

Parameters:
  • index (int) –

  • phone_ids (dict[int,str]) – reverse-map from self.index(). idx -> id

  • num_states (int) – how much state per allophone

  • context_length (int) – how much left/right context

Return type:

int

Return type:

AllophoneState

classmethod from_classic_index(index, allophones, max_states=6)[source]#

Via Sprint C++ Archiver.cc:getStateInfo():

const u32 max_states = 6; // TODO: should be increased for non-speech for (state = 0; state < max_states; ++state) {

if (emission >= allophones_.size()) emission -= (1<<26); else break;

}

Parameters:
  • index (int) –

  • max_states (int) –

  • allophones (dict[int,AllophoneState]) –

Return type:

AllophoneState

class returnn.datasets.lm.Lexicon(filename)[source]#

Lexicon. Map of words to phoneme sequences (can have multiple pronunciations).

Parameters:

filename (str) –

class returnn.datasets.lm.StateTying(state_tying_file)[source]#

Clustering of (allophone) states into classes.

Parameters:

state_tying_file (str) –

class returnn.datasets.lm.PhoneSeqGenerator(lexicon_file, allo_num_states=3, allo_context_len=1, state_tying_file=None, add_silence_beginning=0.1, add_silence_between_words=0.1, add_silence_end=0.1, repetition=0.9, silence_repetition=0.95)[source]#

Generates phone sequences.

Parameters:
  • lexicon_file (str) – lexicon XML file

  • allo_num_states (int) – how much HMM states per allophone (all but silence)

  • allo_context_len (int) – how much context to store left and right. 1 -> triphone

  • state_tying_file (str | None) – for state-tying, if you want that

  • add_silence_beginning (float) – prob of adding silence at beginning

  • add_silence_between_words (float) – prob of adding silence between words

  • add_silence_end (float) – prob of adding silence at end

  • repetition (float) – prob of repeating an allophone

  • silence_repetition (float) – prob of repeating the silence allophone

random_seed(seed)[source]#
Parameters:

seed (int) –

get_class_labels()[source]#
Return type:

list[str]

seq_to_class_idxs(phones, dtype=None)[source]#
Parameters:
  • phones (list[AllophoneState]) – list of allophone states

  • dtype (str) – eg “int32”

Return type:

numpy.ndarray

:returns 1D numpy array with the indices

orth_to_phones(orth)[source]#
Parameters:

orth (str) –

Return type:

str

generate_seq(orth)[source]#
Parameters:

orth (str) – orthography as a str. orth.split() should give words in the lexicon

Return type:

list[AllophoneState]

:returns allophone state list. those will have repetitions etc

generate_garbage_seq(target_len)[source]#
Parameters:

target_len (int) – len of the returned seq

Return type:

list[AllophoneState]

:returns allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().

class returnn.datasets.lm.TranslationDataset(path, file_postfix, source_postfix='', target_postfix='', source_only=False, search_without_reference=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]#

Based on the conventions by our team for translation datasets. It gets a directory and expects these files:

  • source.dev(.gz)

  • source.train(.gz)

  • source.vocab.pkl

  • target.dev(.gz)

  • target.train(.gz)

  • target.vocab.pkl

The convention is to use “dev” and “train” as file_postfix for the dev and train set respectively, but any file_postfix can be used. The target file and vocabulary do not have to exists when setting source_only. It is also automatically checked if a gzip version of the file exists.

To follow the RETURNN conventions on data input and output, the source text is mapped to the “data” key, and the target text to the “classes” data key. Both are index sequences.

Parameters:
  • path (str) – the directory containing the files

  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.

  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().

  • source_postfix (str) – will concat this at the end of the source.

  • target_postfix (str) – will concat this at the end of the target. You might want to add some sentence-end symbol.

  • source_only (bool) – if targets are not available

  • search_without_reference (bool) –

  • unknown_label (str|dict[str,str]|None) – Label to replace out-of-vocabulary words with, e.g. “<UNK>”. If not given, will not replace unknowns but throw an error. Can also be a dict data_key -> unknown_label to configure for each data key separately (default for each key is None).

  • seq_list_file (str) – filename. line-separated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.

  • use_cache_manager (bool) – uses Util.cf() for files

source_file_prefix = 'source'[source]#
target_file_prefix = 'target'[source]#
main_source_data_key = 'data'[source]#
main_target_data_key = 'classes'[source]#
have_corpus_seq_idx()[source]#
Return type:

bool

get_all_tags()[source]#
Return type:

list[str]

get_corpus_seq_idx(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

int

is_data_sparse(key)[source]#
Parameters:

key (str) –

Return type:

bool

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() bool[source]#

supports sorting

get_estimated_seq_length(seq_idx)[source]#
Parameters:

seq_idx (int) – for current epoch, not the corpus seq idx

Return type:

int

:returns sequence length of main source data key (“data”), used for sequence sorting

class returnn.datasets.lm.TranslationFactorsDataset(source_factors=None, target_factors=None, factor_separator='|', **kwargs)[source]#

Extends TranslationDataset with support for translation factors, see https://workshop2016.iwslt.org/downloads/IWSLT_2016_paper_2.pdf, https://arxiv.org/abs/1910.03912.

Each word in the source and/or target corpus is represented by a tuple of tokens (“factors”). The number of factors must be the same for each word in the corpus. The format used is simply the concatenation of all factors separated by a special character (see the ‘factor_separator’ parameter).

Example: “this|u is|l example|u 1.|l” Here, the factor indicates the casing (u for upper-case, l for lower-case).

In addition to the files expected by TranslationDataset we require a vocabulary for all factors. The input sequence will be available in the network for each factor separately via the given data key (see the ‘source_factors’ parameter).

Parameters:
  • source_factors (list[str]|None) – Data keys for the source factors (excluding first factor, which is always called ‘data’). Words in source file have to have that many factors. Also, a vocabulary “<factor_data_key>.vocab.pkl” has to exist for each factor.

  • target_factors (list[str]|None) – analogous to source_factors. Excluding first factor, which is always called ‘classes’.

  • factor_separator (str) – string to separate factors of the words. E.g. if “|”, words are expected to be of format “<factor_0>|<factor_1>|…”.

  • source_postfix (None|str) – See TranslationDataset. Note here, that we apply it to all factors.

  • target_postfix (None|str) – Same as above.

added_data: List[DatasetSeq][source]#
lock: RLock | None[source]#
rnd_seq_drop: Optional[Random][source]#
num_outputs: Optional[Dict[str, Tuple[int, int]]][source]#
labels: Dict[str, List[str]][source]#
class returnn.datasets.lm.ConfusionNetworkDataset(max_density=20, **kwargs)[source]#

This dataset allows for multiple (weighted) options for each word in the source sequence. In particular, it can be used to represent confusion networks. Two matrices (of dimension source length x max_density) will be provided as input to the network, one containing the word ids (“sparse_inputs”) and one containing the weights (“sparse_weights”). The matrices are read from the following input format (example):

“__ALT__ we’re|0.999659__were|0.000341148 a|0.977656__EPS|0.0223441 social|1.0 species|1.0”

Input positions are separated by a space, different word options at one positions are separated by two underscores. Each word option has a weight appended to it, separated by “|”. If “__ALT__” is missing, the line is interpreted as a regular plain text sentence. For this, all weights are set to 1.0 and only one word option is used at each position. Epsilon arcs of confusion networks can be represented by a special token (e.g. “EPS”), which has to be added to the source vocabulary.

Via “seq_list_file” (see TranslationDataset) it is possible to give an explicit order of training examples. This can e.g. be used to repeat the confusion net part of the training data without loading it several times.

Parameters:
  • path (str) – the directory containing the files

  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.

  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().

  • source_postfix (None|str) – will concat this at the end of the source. e.g.

  • target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.

  • source_only (bool) – if targets are not available

  • unknown_label (str|None) – “UNK” or so. if not given, then will not replace unknowns but throw an error

  • max_density (int) – the density of the confusion network: max number of arcs per slot

main_source_data_key = 'sparse_inputs'[source]#
get_data_keys()[source]#
Return type:

list[str]

is_data_sparse(key)[source]#
Parameters:

key (str) –

Return type:

bool

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

get_data_shape(key)[source]#
Parameters:

key (str) –

Return type:

list[int]

added_data: List[DatasetSeq][source]#
lock: RLock | None[source]#
rnd_seq_drop: Optional[Random][source]#
num_outputs: Optional[Dict[str, Tuple[int, int]]][source]#
labels: Dict[str, List[str]][source]#
returnn.datasets.lm.expand_abbreviations(text)[source]#
Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.lowercase(text)[source]#
Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.lowercase_keep_special(text)[source]#
Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.collapse_whitespace(text)[source]#
Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.convert_to_ascii(text)[source]#
Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.basic_cleaners(text)[source]#

Basic pipeline that lowercases and collapses whitespace without transliteration.

Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.transliteration_cleaners(text)[source]#

Pipeline for non-English text that transliterates to ASCII.

Parameters:

text (str) –

Return type:

str

returnn.datasets.lm.english_cleaners(text)[source]#

Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str

returnn.datasets.lm.english_cleaners_keep_special(text)[source]#

Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str

returnn.datasets.lm.get_remove_chars(chars)[source]#
Parameters:

chars (str|list[str]) –

Return type:

(str)->str

returnn.datasets.lm.get_replace(old, new)[source]#
Parameters:
  • old (str) –

  • new (str) –

Return type:

(str)->str

returnn.datasets.lm.normalize_numbers(text, with_spacing=False)[source]#
Parameters:
  • text (str) –

  • with_spacing (bool) –

Return type:

str

returnn.datasets.lm.get_post_processor_function(opts)[source]#

You might want to use inflect or unidecode for some normalization / cleanup. This function can be used to get such functions.

Parameters:

opts (str|list[str]) – e.g. “english_cleaners”, or “get_remove_chars(‘,/’)”

Returns:

function

Return type:

(str)->str