`returnn.datasets.lm`¶

Provides LmDataset, TranslationDataset, and some related helpers.

class returnn.datasets.lm.LmDataset(corpus_file, *, use_cache_manager=False, skip_empty_lines=True, seq_list_file=None, orth_vocab=None, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, orth_post_process=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', dtype: str | None = None, tag_prefix: str | None = None, _debug_limit_line_count: int | None = None, **kwargs)[source]¶

Dataset useful for language modeling. It creates index sequences for either words, characters or other orthographics symbols based on a vocabulary. Can also perform internal word to phoneme conversion with a lexicon file. Reads simple txt files or bliss xml files (also gzipped).

To use the LmDataset with words or characters, either orth_symbols_file or orth_symbols_map_file has to be specified (both is not possible). If words should be used, set word_based to True.

The LmDatasets also support the conversion of words to phonemes with the help of the LmDataset.PhoneSeqGenerator class. To enable this mode, the input parameters to LmDataset.PhoneSeqGenerator have to be provided as dict in phone_info. As a lexicon file has to specified in this dict, orth_symbols_file and orth_symbols_map_file are not used in this case.

The LmDataset does not work without providing a vocabulary with any of the above mentioned ways.

After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index (in case phone_info is not set).

Parameters:

corpus_file (str|()->str|list[str]|()->list[str]) – Bliss XML or line-based txt. optionally can be gzip.
use_cache_manager (bool) – uses returnn.util.basic.cf()
skip_empty_lines (bool) – for line-based txt
seq_list_file (str|list[str]|None) – optional custom seq tags to use instead of the “line-%i” seq tags. Pickle (.pkl) or txt (line-based seq tags). Optionally gzipped (.gz).
orth_vocab (dict[str,Any]|Vocabulary)
orth_symbols_file (str|()->str|None) – a text file containing a list of orthography symbols
orth_symbols_map_file (str|()->str|None) – either a list of orth symbols, each line: “<symbol> <index>”, a python dict with {“<symbol>”: <index>, …} or a pickled dictionary
orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols.
orth_post_process (str|list[str]|function|None) – get_post_processor_function(), applied on orth
word_based (bool) – whether to parse single words, or otherwise will be character based.
word_end_symbol (str|None) – If provided and if word_based is False (character based modeling), token to be used to represent word ends.
seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
unknown_symbol (str|None) – token to represent unknown words.
parse_orth_opts (dict[str,Any]|None) – kwargs for parse_orthography().
phone_info (dict|None) – A dict containing parameters including a lexicon file for LmDataset.PhoneSeqGenerator.
add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data.
log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
error_on_invalid_seq (bool) – if there is a seq we would have to skip, error.
add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:-1].
delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data.
dtype – explicit dtype. if not given, automatically determined based on the number of labels.
tag_prefix – prefix for sequence tags. by default “line-“.
_debug_limit_line_count

get_data_keys()[source]¶

Return type:: list[str]

get_target_list()[source]¶

Unfortunately, the logic is swapped around for this dataset. “data” is the original data, which is usually the target, and you would use “delayed” as inputs.

Return type:: list[str]

get_data_dtype(key)[source]¶

Parameters:: key (str)
Return type:: str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:

epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

get_current_seq_order() → List[int][source]¶

Returns:: seq order of current epoch

supports_seq_order_sorting() → bool[source]¶: supports sorting

supports_sharding() → bool[source]¶

Returns:: whether this dataset supports sharding

get_total_num_seqs(*, fast: bool = False) → int[source]¶: total num seqs

get_all_tags() → List[str][source]¶

Returns:: all seq tags

finish_epoch(*, free_resources: bool = False)[source]¶: finish epoch

returnn.datasets.lm.iter_corpus(filename: str, callback: Callable[[str | bytes], None], *, skip_empty_lines: bool = True, decode: bool = True) → None[source]¶

Parameters:

filename
callback
skip_empty_lines
decode

returnn.datasets.lm.read_corpus(filename: str, *, skip_empty_lines: bool = True, decode: bool = True, out_list: List[str] | List[bytes] | None = None) → List[str] | List[bytes][source]¶

Parameters:

filename – either Bliss XML or line-based text
skip_empty_lines – in case of line-based text, skip empty lines
decode – if True, return str, otherwise bytes
out_list – if given, append to this list

Returns:

out_list, list of orthographies

class returnn.datasets.lm.AllophoneState(id=None, state=None)[source]¶

Represents one allophone (phone with context) state (number, boundary). In Sprint, see AllophoneStateAlphabet::index().

Parameters:

id (str) – phone
state (int|None)

context_history = ()[source]¶

context_future = ()[source]¶

boundary = 0[source]¶

id = None[source]¶

state = None[source]¶

format()[source]¶

Return type:: str

copy()[source]¶

Return type:: AllophoneState

mark_initial()[source]¶: Add flag to self.boundary.

mark_final()[source]¶: Add flag to self.boundary.

phoneme(ctx_offset, out_of_context_id=None)[source]¶

Phoneme::Id ContextPhonology::PhonemeInContext::phoneme(s16 pos) const {

if (pos == 0)

return phoneme_;

else if (pos > 0) {

if (u16(pos - 1) < context_.future.length()): return context_.future[pos - 1];
else: return Phoneme::term;

} else { verify(pos < 0);

if (u16(-1 - pos) < context_.history.length()): return context_.history[-1 - pos];
else: return Phoneme::term;

}

Parameters:

ctx_offset (int) – 0 for center, >0 for future, <0 for history
out_of_context_id (str|None) – what to return out of our context

Returns:

phone-id from the offset

Return type:

str

set_phoneme(ctx_offset, phone_id)[source]¶

Parameters:

ctx_offset (int) – 0 for center, >0 for future, <0 for history
phone_id (str)

phone_idx(ctx_offset, phone_idxs)[source]¶

Parameters:

ctx_offset (int) – see self.phoneme()
phone_idxs (dict[str,int])

Return type:

int

index(phone_idxs, num_states=3, context_length=1)[source]¶

See self.from_index() for the inverse function. And see Sprint NoStateTyingDense::classify().

Parameters:

phone_idxs (dict[str,int])
num_states (int) – how much state per allophone
context_length (int) – how much left/right context

Return type:

int

classmethod from_index(index, phone_ids, num_states=3, context_length=1)[source]¶

Original Sprint C++ code:

Mm::MixtureIndex NoStateTyingDense::classify(const AllophoneState& a) const {
require_lt(a.allophone()->boundary, numBoundaryClasses_); require_le(0, a.state()); require_lt(u32(a.state()), numStates_); u32 result = 0; for(u32 i = 0; i < 2 * contextLength_ + 1; ++i) { // context len is usually 1

// pos sequence: 0, -1, 1, [-2, 2, …] s16 pos = i / 2; if(i % 2 == 1)

pos = -pos - 1;

result *= numPhoneClasses_; u32 phoneIdx = a.allophone()->phoneme(pos); require_lt(phoneIdx, numPhoneClasses_); result += phoneIdx;

} result *= numStates_; result += u32(a.state()); result *= numBoundaryClasses_; result += a.allophone()->boundary; require_lt(result, nClasses_); return result;

}

Note that there is also AllophoneStateAlphabet::allophoneState, via Am/ClassicStateModel.cc, which unfortunately uses a different encoding. See from_classic_index().

Parameters:

index (int)
phone_ids (dict[int,str]) – reverse-map from self.index(). idx -> id
num_states (int) – how much state per allophone
context_length (int) – how much left/right context

Return type:

int

Return type:

AllophoneState

classmethod from_classic_index(index, allophones, max_states=6)[source]¶

Via Sprint C++ Archiver.cc:getStateInfo():

const u32 max_states = 6; // TODO: should be increased for non-speech for (state = 0; state < max_states; ++state) {

if (emission >= allophones_.size()) emission -= (1<<26); else break;

}

Parameters:

index (int)
max_states (int)
allophones (dict[int,AllophoneState])

Return type:

AllophoneState

class returnn.datasets.lm.Lexicon(filename: str)[source]¶

Lexicon. Map of words to phoneme sequences (can have multiple pronunciations).

Parameters:: filename

class returnn.datasets.lm.StateTying(state_tying_file: str)[source]¶

Clustering of (allophone) states into classes.

Parameters:: state_tying_file

class returnn.datasets.lm.PhoneSeqGenerator(*, lexicon_file: str, phoneme_vocab_file: str | None = None, allo_num_states: int = 3, allo_context_len: int = 1, state_tying_file: str | None = None, add_silence_beginning: float = 0.1, add_silence_between_words: float = 0.1, add_silence_end: float = 0.1, repetition: float = 0.9, silence_repetition: float = 0.95, silence_lemma_orth: str = '[SILENCE]', extra_begin_lemma: Dict[str, Any] | None = None, add_extra_begin_lemma: float = 1.0, extra_end_lemma: Dict[str, Any] | None = None, add_extra_end_lemma: float = 1.0, phon_pick_strategy: Literal['random', 'first'] = 'random')[source]¶

Generates phone sequences.

Parameters:

lexicon_file – lexicon XML file
phoneme_vocab_file – defines the vocab, label indices. If not given, automatically inferred via all (sorted) phonemes from the lexicon.
allo_num_states – how much HMM states per allophone (all but silence)
allo_context_len – how much context to store left and right. 1 -> triphone
state_tying_file – for state-tying, if you want that
add_silence_beginning – prob of adding silence at beginning
add_silence_between_words – prob of adding silence between words
add_silence_end – prob of adding silence at end
repetition – prob of repeating an allophone
silence_repetition – prob of repeating the silence allophone
silence_lemma_orth – silence orth in the lexicon
extra_begin_lemma – {“phons”: [{“phon”: “P1 P2 …”, …}, …], …}. If given, then with prob add_extra_begin_lemma, this will be added at the beginning.
add_extra_begin_lemma
extra_end_lemma – just like extra_begin_lemma, but for the end
add_extra_end_lemma
phon_pick_strategy – “random” or “first”. If “random”, then lemmas are picked randomly if multiple pronunciations exist.

random_seed(seed: int)[source]¶: Reset RNG via given seed

get_class_labels() → List[str][source]¶

Returns:: class labels

seq_to_class_idxs(phones: List[AllophoneState], dtype: str | None = None) → ndarray[source]¶

Parameters:

phones – list of allophone states
dtype – eg “int32”. “int32” by default

Returns:

1D numpy array with the indices

orth_to_phones(orth: str) → str[source]¶

Returns:: space-separated phones

generate_seq(orth: str) → List[AllophoneState][source]¶

Parameters:: orth – orthography as a str. orth.split() should give words in the lexicon
Returns:: allophone state list. those will have repetitions etc

generate_garbage_seq(target_len: int) → List[AllophoneState][source]¶

Parameters:: target_len – len of the returned seq
Returns:: allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().

class returnn.datasets.lm.TranslationDataset(path, file_postfix, source_postfix='', target_postfix='', source_only=False, search_without_reference=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]¶

Based on the conventions by our team for translation datasets. It gets a directory and expects these files:

source.dev(.gz)

source.train(.gz)

source.vocab.pkl

target.dev(.gz)

target.train(.gz)

target.vocab.pkl

The convention is to use “dev” and “train” as file_postfix for the dev and train set respectively, but any file_postfix can be used. The target file and vocabulary do not have to exists when setting source_only. It is also automatically checked if a gzip version of the file exists.

To follow the RETURNN conventions on data input and output, the source text is mapped to the “data” key, and the target text to the “classes” data key. Both are index sequences.

Parameters:

path (str) – the directory containing the files
file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
source_postfix (str) – will concat this at the end of the source.
target_postfix (str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
source_only (bool) – if targets are not available
search_without_reference (bool)
unknown_label (str|dict[str,str]|None) – Label to replace out-of-vocabulary words with, e.g. “<UNK>”. If not given, will not replace unknowns but throw an error. Can also be a dict data_key -> unknown_label to configure for each data key separately (default for each key is None).
seq_list_file (str) – filename. line-separated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.
use_cache_manager (bool) – uses Util.cf() for files

source_file_prefix = 'source'[source]¶

target_file_prefix = 'target'[source]¶

main_source_data_key = 'data'[source]¶

main_target_data_key = 'classes'[source]¶

have_corpus_seq_idx()[source]¶

Return type:: bool

get_all_tags()[source]¶

Return type:: list[str]

get_corpus_seq_idx(seq_idx)[source]¶

Parameters:: seq_idx (int)
Return type:: int

is_data_sparse(key)[source]¶

Parameters:: key (str)
Return type:: bool

get_data_dim(_key: str) → int[source]¶

Returns:: the data dim of data entry _key

get_data_dtype(key)[source]¶

Parameters:: key (str)
Return type:: str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:

epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() → bool[source]¶: supports sorting

get_estimated_seq_length(seq_idx)[source]¶

Parameters:: seq_idx (int) – for current epoch, not the corpus seq idx
Return type:: int

:returns sequence length of main source data key (“data”), used for sequence sorting

class returnn.datasets.lm.TranslationFactorsDataset(source_factors=None, target_factors=None, factor_separator='|', **kwargs)[source]¶

Extends TranslationDataset with support for translation factors, see https://workshop2016.iwslt.org/downloads/IWSLT_2016_paper_2.pdf, https://arxiv.org/abs/1910.03912.

Each word in the source and/or target corpus is represented by a tuple of tokens (“factors”). The number of factors must be the same for each word in the corpus. The format used is simply the concatenation of all factors separated by a special character (see the ‘factor_separator’ parameter).

Example: “this|u is|l example|u 1.|l” Here, the factor indicates the casing (u for upper-case, l for lower-case).

In addition to the files expected by TranslationDataset we require a vocabulary for all factors. The input sequence will be available in the network for each factor separately via the given data key (see the ‘source_factors’ parameter).

Parameters:

source_factors (list[str]|None) – Data keys for the source factors (excluding first factor, which is always called ‘data’). Words in source file have to have that many factors. Also, a vocabulary “<factor_data_key>.vocab.pkl” has to exist for each factor.
target_factors (list[str]|None) – analogous to source_factors. Excluding first factor, which is always called ‘classes’.
factor_separator (str) – string to separate factors of the words. E.g. if “|”, words are expected to be of format “<factor_0>|<factor_1>|…”.
source_postfix (None|str) – See TranslationDataset. Note here, that we apply it to all factors.
target_postfix (None|str) – Same as above.

class returnn.datasets.lm.ConfusionNetworkDataset(max_density=20, **kwargs)[source]¶

This dataset allows for multiple (weighted) options for each word in the source sequence. In particular, it can be used to represent confusion networks. Two matrices (of dimension source length x max_density) will be provided as input to the network, one containing the word ids (“sparse_inputs”) and one containing the weights (“sparse_weights”). The matrices are read from the following input format (example):

“__ALT__ we’re|0.999659__were|0.000341148 a|0.977656__EPS|0.0223441 social|1.0 species|1.0”

Input positions are separated by a space, different word options at one positions are separated by two underscores. Each word option has a weight appended to it, separated by “|”. If “__ALT__” is missing, the line is interpreted as a regular plain text sentence. For this, all weights are set to 1.0 and only one word option is used at each position. Epsilon arcs of confusion networks can be represented by a special token (e.g. “EPS”), which has to be added to the source vocabulary.

Via “seq_list_file” (see TranslationDataset) it is possible to give an explicit order of training examples. This can e.g. be used to repeat the confusion net part of the training data without loading it several times.

Parameters:

path (str) – the directory containing the files
file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
source_postfix (None|str) – will concat this at the end of the source. e.g.
target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
source_only (bool) – if targets are not available
unknown_label (str|None) – “UNK” or so. if not given, then will not replace unknowns but throw an error
max_density (int) – the density of the confusion network: max number of arcs per slot

main_source_data_key = 'sparse_inputs'[source]¶

get_data_keys()[source]¶

Return type:: list[str]

is_data_sparse(key)[source]¶

Parameters:: key (str)
Return type:: bool

get_data_dtype(key)[source]¶

Parameters:: key (str)
Return type:: str

get_data_shape(key)[source]¶

Parameters:: key (str)
Return type:: list[int]

returnn.datasets.lm.expand_abbreviations(text)[source]¶

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.lowercase(text)[source]¶

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.lowercase_keep_special(text)[source]¶

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.collapse_whitespace(text)[source]¶

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.convert_to_ascii(text)[source]¶

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.basic_cleaners(text)[source]¶

Basic pipeline that lowercases and collapses whitespace without transliteration.

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.transliteration_cleaners(text)[source]¶

Pipeline for non-English text that transliterates to ASCII.

Parameters:: text (str)
Return type:: str

returnn.datasets.lm.english_cleaners(text)[source]¶: Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str

returnn.datasets.lm.english_cleaners_keep_special(text)[source]¶: Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str

returnn.datasets.lm.get_remove_chars(chars)[source]¶

Parameters:: chars (str|list[str])
Return type:: (str)->str

returnn.datasets.lm.get_replace(old, new)[source]¶

Parameters:

old (str)
new (str)

Return type:

(str)->str

returnn.datasets.lm.normalize_numbers(text, with_spacing=False)[source]¶

Parameters:

text (str)
with_spacing (bool)

Return type:

str

returnn.datasets.lm.get_post_processor_function(opts)[source]¶

You might want to use inflect or unidecode for some normalization / cleanup. This function can be used to get such functions.

Parameters:: opts (str|list[str]|function) – e.g. “english_cleaners”, or “get_remove_chars(‘,/’)”
Returns:: function
Return type:: (str)->str

returnn.datasets.lm¶

`returnn.datasets.lm`¶