returnn.datasets.lm
¶
Provides LmDataset
, TranslationDataset
,
and some related helpers.
- class returnn.datasets.lm.LmDataset(corpus_file, *, use_cache_manager=False, skip_empty_lines=True, seq_list_file=None, orth_vocab=None, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', dtype: str | None = None, **kwargs)[source]¶
Dataset useful for language modeling. It creates index sequences for either words, characters or other orthographics symbols based on a vocabulary. Can also perform internal word to phoneme conversion with a lexicon file. Reads simple txt files or bliss xml files (also gzipped).
To use the LmDataset with words or characters, either
orth_symbols_file
ororth_symbols_map_file
has to be specified (both is not possible). If words should be used, setword_based
to True.The LmDatasets also support the conversion of words to phonemes with the help of the
LmDataset.PhoneSeqGenerator
class. To enable this mode, the input parameters toLmDataset.PhoneSeqGenerator
have to be provided as dict inphone_info
. As a lexicon file has to specified in this dict,orth_symbols_file
andorth_symbols_map_file
are not used in this case.The LmDataset does not work without providing a vocabulary with any of the above mentioned ways.
After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index (in case
phone_info
is not set).- Parameters:
corpus_file (str|()->str|list[str]|()->list[str]) – Bliss XML or line-based txt. optionally can be gzip.
use_cache_manager (bool) – uses
returnn.util.basic.cf()
skip_empty_lines (bool) – for line-based txt
seq_list_file (str|list[str]|None) – optional custom seq tags to use instead of the “line-%i” seq tags. Pickle (.pkl) or txt (line-based seq tags). Optionally gzipped (.gz).
orth_vocab (dict[str,Any]|Vocabulary)
orth_symbols_file (str|()->str|None) – a text file containing a list of orthography symbols
orth_symbols_map_file (str|()->str|None) – either a list of orth symbols, each line: “<symbol> <index>”, a python dict with {“<symbol>”: <index>, …} or a pickled dictionary
orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols.
word_based (bool) – whether to parse single words, or otherwise will be character based.
word_end_symbol (str|None) – If provided and if word_based is False (character based modeling), token to be used to represent word ends.
seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
unknown_symbol (str|None) – token to represent unknown words.
parse_orth_opts (dict[str,Any]|None) – kwargs for parse_orthography().
phone_info (dict|None) – A dict containing parameters including a lexicon file for
LmDataset.PhoneSeqGenerator
.add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data.
log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
error_on_invalid_seq (bool) – if there is a seq we would have to skip, error.
add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:-1].
delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data.
dtype – explicit dtype. if not given, automatically determined based on the number of labels.
- get_target_list()[source]¶
Unfortunately, the logic is swapped around for this dataset. “data” is the original data, which is usually the target, and you would use “delayed” as inputs.
- Return type:
list[str]
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.
- Parameters:
epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- returnn.datasets.lm.iter_corpus(filename: str, callback: Callable[[str | bytes], None], *, skip_empty_lines: bool = True, decode: bool = True) None [source]¶
- Parameters:
filename
callback
skip_empty_lines
decode
- returnn.datasets.lm.read_corpus(filename: str, *, skip_empty_lines: bool = True, decode: bool = True, out_list: List[str] | List[bytes] | None = None) List[str] | List[bytes] [source]¶
- Parameters:
filename – either Bliss XML or line-based text
skip_empty_lines – in case of line-based text, skip empty lines
decode – if True, return str, otherwise bytes
out_list – if given, append to this list
- Returns:
out_list, list of orthographies
- class returnn.datasets.lm.AllophoneState(id=None, state=None)[source]¶
Represents one allophone (phone with context) state (number, boundary). In Sprint, see AllophoneStateAlphabet::index().
- Parameters:
id (str) – phone
state (int|None)
- phoneme(ctx_offset, out_of_context_id=None)[source]¶
- Phoneme::Id ContextPhonology::PhonemeInContext::phoneme(s16 pos) const {
- if (pos == 0)
return phoneme_;
- else if (pos > 0) {
- } else { verify(pos < 0);
}
}
- Parameters:
ctx_offset (int) – 0 for center, >0 for future, <0 for history
out_of_context_id (str|None) – what to return out of our context
- Returns:
phone-id from the offset
- Return type:
str
- set_phoneme(ctx_offset, phone_id)[source]¶
- Parameters:
ctx_offset (int) – 0 for center, >0 for future, <0 for history
phone_id (str)
- phone_idx(ctx_offset, phone_idxs)[source]¶
- Parameters:
ctx_offset (int) – see self.phoneme()
phone_idxs (dict[str,int])
- Return type:
int
- index(phone_idxs, num_states=3, context_length=1)[source]¶
See self.from_index() for the inverse function. And see Sprint NoStateTyingDense::classify().
- Parameters:
phone_idxs (dict[str,int])
num_states (int) – how much state per allophone
context_length (int) – how much left/right context
- Return type:
int
- classmethod from_index(index, phone_ids, num_states=3, context_length=1)[source]¶
Original Sprint C++ code:
- Mm::MixtureIndex NoStateTyingDense::classify(const AllophoneState& a) const {
require_lt(a.allophone()->boundary, numBoundaryClasses_); require_le(0, a.state()); require_lt(u32(a.state()), numStates_); u32 result = 0; for(u32 i = 0; i < 2 * contextLength_ + 1; ++i) { // context len is usually 1
// pos sequence: 0, -1, 1, [-2, 2, …] s16 pos = i / 2; if(i % 2 == 1)
pos = -pos - 1;
result *= numPhoneClasses_; u32 phoneIdx = a.allophone()->phoneme(pos); require_lt(phoneIdx, numPhoneClasses_); result += phoneIdx;
} result *= numStates_; result += u32(a.state()); result *= numBoundaryClasses_; result += a.allophone()->boundary; require_lt(result, nClasses_); return result;
}
Note that there is also AllophoneStateAlphabet::allophoneState, via Am/ClassicStateModel.cc, which unfortunately uses a different encoding. See
from_classic_index()
.- Parameters:
index (int)
phone_ids (dict[int,str]) – reverse-map from self.index(). idx -> id
num_states (int) – how much state per allophone
context_length (int) – how much left/right context
- Return type:
int
- Return type:
- classmethod from_classic_index(index, allophones, max_states=6)[source]¶
Via Sprint C++ Archiver.cc:getStateInfo():
const u32 max_states = 6; // TODO: should be increased for non-speech for (state = 0; state < max_states; ++state) {
if (emission >= allophones_.size()) emission -= (1<<26); else break;
}
- Parameters:
index (int)
max_states (int)
allophones (dict[int,AllophoneState])
- Return type:
- class returnn.datasets.lm.Lexicon(filename: str)[source]¶
Lexicon. Map of words to phoneme sequences (can have multiple pronunciations).
- Parameters:
filename
- class returnn.datasets.lm.StateTying(state_tying_file: str)[source]¶
Clustering of (allophone) states into classes.
- Parameters:
state_tying_file
- class returnn.datasets.lm.PhoneSeqGenerator(*, lexicon_file: str, phoneme_vocab_file: str | None = None, allo_num_states: int = 3, allo_context_len: int = 1, state_tying_file: str | None = None, add_silence_beginning: float = 0.1, add_silence_between_words: float = 0.1, add_silence_end: float = 0.1, repetition: float = 0.9, silence_repetition: float = 0.95, silence_lemma_orth: str = '[SILENCE]', extra_begin_lemma: Dict[str, Any] | None = None, add_extra_begin_lemma: float = 1.0, extra_end_lemma: Dict[str, Any] | None = None, add_extra_end_lemma: float = 1.0)[source]¶
Generates phone sequences.
- Parameters:
lexicon_file – lexicon XML file
phoneme_vocab_file – defines the vocab, label indices. If not given, automatically inferred via all (sorted) phonemes from the lexicon.
allo_num_states – how much HMM states per allophone (all but silence)
allo_context_len – how much context to store left and right. 1 -> triphone
state_tying_file – for state-tying, if you want that
add_silence_beginning – prob of adding silence at beginning
add_silence_between_words – prob of adding silence between words
add_silence_end – prob of adding silence at end
repetition – prob of repeating an allophone
silence_repetition – prob of repeating the silence allophone
silence_lemma_orth – silence orth in the lexicon
extra_begin_lemma – {“phons”: [{“phon”: “P1 P2 …”, …}, …], …}. If given, then with prob add_extra_begin_lemma, this will be added at the beginning.
add_extra_begin_lemma
extra_end_lemma – just like
extra_begin_lemma
, but for the endadd_extra_end_lemma
- seq_to_class_idxs(phones: List[AllophoneState], dtype: str | None = None) ndarray [source]¶
- Parameters:
phones – list of allophone states
dtype – eg “int32”. “int32” by default
- Returns:
1D numpy array with the indices
- generate_seq(orth: str) List[AllophoneState] [source]¶
- Parameters:
orth – orthography as a str. orth.split() should give words in the lexicon
- Returns:
allophone state list. those will have repetitions etc
- generate_garbage_seq(target_len: int) List[AllophoneState] [source]¶
- Parameters:
target_len – len of the returned seq
- Returns:
allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().
- class returnn.datasets.lm.TranslationDataset(path, file_postfix, source_postfix='', target_postfix='', source_only=False, search_without_reference=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]¶
Based on the conventions by our team for translation datasets. It gets a directory and expects these files:
source.dev(.gz)
source.train(.gz)
source.vocab.pkl
target.dev(.gz)
target.train(.gz)
target.vocab.pkl
The convention is to use “dev” and “train” as
file_postfix
for the dev and train set respectively, but any file_postfix can be used. The target file and vocabulary do not have to exists when settingsource_only
. It is also automatically checked if a gzip version of the file exists.To follow the RETURNN conventions on data input and output, the source text is mapped to the “data” key, and the target text to the “classes” data key. Both are index sequences.
- Parameters:
path (str) – the directory containing the files
file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
source_postfix (str) – will concat this at the end of the source.
target_postfix (str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
source_only (bool) – if targets are not available
search_without_reference (bool)
unknown_label (str|dict[str,str]|None) – Label to replace out-of-vocabulary words with, e.g. “<UNK>”. If not given, will not replace unknowns but throw an error. Can also be a dict data_key -> unknown_label to configure for each data key separately (default for each key is None).
seq_list_file (str) – filename. line-separated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.
use_cache_manager (bool) – uses
Util.cf()
for files
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.
- Parameters:
epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- class returnn.datasets.lm.TranslationFactorsDataset(source_factors=None, target_factors=None, factor_separator='|', **kwargs)[source]¶
Extends TranslationDataset with support for translation factors, see https://workshop2016.iwslt.org/downloads/IWSLT_2016_paper_2.pdf, https://arxiv.org/abs/1910.03912.
Each word in the source and/or target corpus is represented by a tuple of tokens (“factors”). The number of factors must be the same for each word in the corpus. The format used is simply the concatenation of all factors separated by a special character (see the ‘factor_separator’ parameter).
Example: “this|u is|l example|u 1.|l” Here, the factor indicates the casing (u for upper-case, l for lower-case).
In addition to the files expected by TranslationDataset we require a vocabulary for all factors. The input sequence will be available in the network for each factor separately via the given data key (see the ‘source_factors’ parameter).
- Parameters:
source_factors (list[str]|None) – Data keys for the source factors (excluding first factor, which is always called ‘data’). Words in source file have to have that many factors. Also, a vocabulary “<factor_data_key>.vocab.pkl” has to exist for each factor.
target_factors (list[str]|None) – analogous to source_factors. Excluding first factor, which is always called ‘classes’.
factor_separator (str) – string to separate factors of the words. E.g. if “|”, words are expected to be of format “<factor_0>|<factor_1>|…”.
source_postfix (None|str) – See TranslationDataset. Note here, that we apply it to all factors.
target_postfix (None|str) – Same as above.
- class returnn.datasets.lm.ConfusionNetworkDataset(max_density=20, **kwargs)[source]¶
This dataset allows for multiple (weighted) options for each word in the source sequence. In particular, it can be used to represent confusion networks. Two matrices (of dimension source length x max_density) will be provided as input to the network, one containing the word ids (“sparse_inputs”) and one containing the weights (“sparse_weights”). The matrices are read from the following input format (example):
“__ALT__ we’re|0.999659__were|0.000341148 a|0.977656__EPS|0.0223441 social|1.0 species|1.0”
Input positions are separated by a space, different word options at one positions are separated by two underscores. Each word option has a weight appended to it, separated by “|”. If “__ALT__” is missing, the line is interpreted as a regular plain text sentence. For this, all weights are set to 1.0 and only one word option is used at each position. Epsilon arcs of confusion networks can be represented by a special token (e.g. “EPS”), which has to be added to the source vocabulary.
Via “seq_list_file” (see TranslationDataset) it is possible to give an explicit order of training examples. This can e.g. be used to repeat the confusion net part of the training data without loading it several times.
- Parameters:
path (str) – the directory containing the files
file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
source_postfix (None|str) – will concat this at the end of the source. e.g.
target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
source_only (bool) – if targets are not available
unknown_label (str|None) – “UNK” or so. if not given, then will not replace unknowns but throw an error
max_density (int) – the density of the confusion network: max number of arcs per slot
- returnn.datasets.lm.basic_cleaners(text)[source]¶
Basic pipeline that lowercases and collapses whitespace without transliteration.
- Parameters:
text (str)
- Return type:
str
- returnn.datasets.lm.transliteration_cleaners(text)[source]¶
Pipeline for non-English text that transliterates to ASCII.
- Parameters:
text (str)
- Return type:
str
- returnn.datasets.lm.english_cleaners(text)[source]¶
Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str
- returnn.datasets.lm.english_cleaners_keep_special(text)[source]¶
Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str
- returnn.datasets.lm.get_remove_chars(chars)[source]¶
- Parameters:
chars (str|list[str])
- Return type:
(str)->str
- returnn.datasets.lm.get_replace(old, new)[source]¶
- Parameters:
old (str)
new (str)
- Return type:
(str)->str
- returnn.datasets.lm.normalize_numbers(text, with_spacing=False)[source]¶
- Parameters:
text (str)
with_spacing (bool)
- Return type:
str
- returnn.datasets.lm.get_post_processor_function(opts)[source]¶
You might want to use
inflect
orunidecode
for some normalization / cleanup. This function can be used to get such functions.- Parameters:
opts (str|list[str]) – e.g. “english_cleaners”, or “get_remove_chars(‘,/’)”
- Returns:
function
- Return type:
(str)->str