LmDataset

Provides LmDataset, TranslationDataset, and some related helpers.

class LmDataset.LmDataset(corpus_file, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]

Dataset useful for language modeling. Reads simple txt files.

After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index.

Parameters:
  • corpus_file (str|()->str|list[str]|()->list[str]) – Bliss XML or line-based txt. optionally can be gzip.
  • phone_info (dict|None) – if you want to get phone seqs, dict with lexicon_file etc. see PhoneSeqGenerator.
  • orth_symbols_file (str|()->str|None) – list of orthography symbols, if you want to get orth symbol seqs.
  • orth_symbols_map_file (str|()->str|None) – list of orth symbols, each line: “symbol index”.
  • orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols.
  • word_based (bool) – whether to parse single words, or otherwise will be character based.
  • word_end_symbol (str|None) – If provided and if word_based is False (character based modeling), token to be used to represent word ends.
  • seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
  • unknown_symbol (str|None) – token to represent unknown words.
  • parse_orth_opts (dict[str]|None) – kwargs for parse_orthography().
  • add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data.
  • log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
  • log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
  • error_on_invalid_seq (bool) – if there is a seq we would have to skip, error.
  • add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:-1].
  • delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data.
get_data_keys(self)[source]
Return type:list[str]
get_target_list(self)[source]

Unfortunately, the logic is swapped around for this dataset. “data” is the original data, which is usually the target, and you would use “delayed” as inputs.

Return type:list[str]
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
init_seq_order(self, epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

LmDataset.iter_corpus(filename, callback)[source]
Parameters:
  • filename (str) –
  • callback (((str)->None)) –
LmDataset.read_corpus(filename)[source]
Parameters:filename (str) –
Returns:list of orthographies
Return type:list[str]
class LmDataset.AllophoneState(id=None, state=None)[source]

Represents one allophone (phone with context) state (number, boundary). In Sprint, see AllophoneStateAlphabet::index().

Parameters:
  • id (str) – phone
  • state (int|None) –
context_history = ()[source]
context_future = ()[source]
boundary = 0[source]
id = None[source]
state = None[source]
format(self)[source]
Return type:str
copy(self)[source]
Return type:AllophoneState
mark_initial(self)[source]

Add flag to self.boundary.

mark_final(self)[source]

Add flag to self.boundary.

phoneme(self, ctx_offset, out_of_context_id=None)[source]
Phoneme::Id ContextPhonology::PhonemeInContext::phoneme(s16 pos) const {
if (pos == 0)
return phoneme_;
else if (pos > 0) {
if (u16(pos - 1) < context_.future.length())
return context_.future[pos - 1];
else
return Phoneme::term;
} else { verify(pos < 0);
if (u16(-1 - pos) < context_.history.length())
return context_.history[-1 - pos];
else
return Phoneme::term;

}

}

Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history
  • out_of_context_id (str|None) – what to return out of our context
Returns:

phone-id from the offset

Return type:

str

set_phoneme(self, ctx_offset, phone_id)[source]
Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history
  • phone_id (str) –
phone_idx(self, ctx_offset, phone_idxs)[source]
Parameters:
  • ctx_offset (int) – see self.phoneme()
  • phone_idxs (dict[str,int]) –
Return type:

int

index(self, phone_idxs, num_states=3, context_length=1)[source]

See self.from_index() for the inverse function. And see Sprint NoStateTyingDense::classify().

Parameters:
  • phone_idxs (dict[str,int]) –
  • num_states (int) – how much state per allophone
  • context_length (int) – how much left/right context
Return type:

int

classmethod from_index(index, phone_ids, num_states=3, context_length=1)[source]

Original Sprint C++ code:

Mm::MixtureIndex NoStateTyingDense::classify(const AllophoneState& a) const {

require_lt(a.allophone()->boundary, numBoundaryClasses_); require_le(0, a.state()); require_lt(u32(a.state()), numStates_); u32 result = 0; for(u32 i = 0; i < 2 * contextLength_ + 1; ++i) { // context len is usually 1

// pos sequence: 0, -1, 1, [-2, 2, …] s16 pos = i / 2; if(i % 2 == 1)

pos = -pos - 1;

result *= numPhoneClasses_; u32 phoneIdx = a.allophone()->phoneme(pos); require_lt(phoneIdx, numPhoneClasses_); result += phoneIdx;

} result *= numStates_; result += u32(a.state()); result *= numBoundaryClasses_; result += a.allophone()->boundary; require_lt(result, nClasses_); return result;

}

Note that there is also AllophoneStateAlphabet::allophoneState, via Am/ClassicStateModel.cc, which unfortunately uses a different encoding. See from_classic_index().

Parameters:
  • index (int) –
  • phone_ids (dict[int,str]) – reverse-map from self.index(). idx -> id
  • num_states (int) – how much state per allophone
  • context_length (int) – how much left/right context
Return type:

int

Return type:

AllophoneState

classmethod from_classic_index(index, allophones, max_states=6)[source]

Via Sprint C++ Archiver.cc:getStateInfo():

const u32 max_states = 6; // TODO: should be increased for non-speech for (state = 0; state < max_states; ++state) {

if (emission >= allophones_.size()) emission -= (1<<26); else break;

}

Parameters:
  • index (int) –
  • max_states (int) –
  • allophones (dict[int,AllophoneState]) –
Return type:

AllophoneState

class LmDataset.Lexicon(filename)[source]

Lexicon. Map of words to phoneme sequences (can have multiple pronunciations).

Parameters:filename (str) –
class LmDataset.StateTying(state_tying_file)[source]

Clustering of (allophone) states into classes.

Parameters:state_tying_file (str) –
class LmDataset.PhoneSeqGenerator(lexicon_file, allo_num_states=3, allo_context_len=1, state_tying_file=None, add_silence_beginning=0.1, add_silence_between_words=0.1, add_silence_end=0.1, repetition=0.9, silence_repetition=0.95)[source]

Generates phone sequences.

Parameters:
  • lexicon_file (str) – lexicon XML file
  • allo_num_states (int) – how much HMM states per allophone (all but silence)
  • allo_context_len (int) – how much context to store left and right. 1 -> triphone
  • | None state_tying_file (str) – for state-tying, if you want that
  • add_silence_beginning (float) – prob of adding silence at beginning
  • add_silence_between_words (float) – prob of adding silence between words
  • add_silence_end (float) – prob of adding silence at end
  • repetition (float) – prob of repeating an allophone
  • silence_repetition (float) – prob of repeating the silence allophone
random_seed(self, seed)[source]
Parameters:seed (int) –
get_class_labels(self)[source]
Return type:list[str]
seq_to_class_idxs(self, phones, dtype=None)[source]
Parameters:
  • phones (list[AllophoneState]) – list of allophone states
  • dtype (str) – eg “int32”
Return type:

numpy.ndarray

:returns 1D numpy array with the indices

orth_to_phones(self, orth)[source]
Parameters:orth (str) –
Return type:str
generate_seq(self, orth)[source]
Parameters:orth (str) – orthography as a str. orth.split() should give words in the lexicon
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc

generate_garbage_seq(self, target_len)[source]
Parameters:target_len (int) – len of the returned seq
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().

class LmDataset.TranslationDataset(path, file_postfix, source_postfix='', target_postfix='', source_only=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]

Based on the conventions by our team for translation datasets. It gets a directory and expects these files:

source.dev(.gz)? source.train(.gz)? source.vocab.pkl target.dev(.gz)? target.train(.gz)? target.vocab.pkl
Parameters:
  • path (str) – the directory containing the files
  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
  • source_postfix (None|str) – will concat this at the end of the source. e.g.
  • target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
  • source_only (bool) – if targets are not available
  • unknown_label (str|None) – “UNK” or so. if not given, then will not replace unknowns but throw an error
  • seq_list_file (str) – filename. line-separated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.
  • use_cache_manager (bool) – uses Util.cf() for files
MapToDataKeys = {'source': 'data', 'target': 'classes'}[source]
have_corpus_seq_idx(self)[source]
Return type:bool
get_corpus_seq_idx(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:int
is_data_sparse(self, key)[source]
Parameters:key (str) –
Return type:bool
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
init_seq_order(self, epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

class LmDataset.ConfusionNetworkDataset(max_density=20, **kwargs)[source]

This dataset allows for multiple (weighted) options for each word in the source sequence. In particular, it can be used to represent confusion networks. Two matrices (of dimension source length x max_density) will be provided as input to the network, one containing the word ids (“sparse_inputs”) and one containing the weights (“sparse_weights”). The matrices are read from the following input format (example):

“__ALT__ we’re|0.999659__were|0.000341148 a|0.977656__EPS|0.0223441 social|1.0 species|1.0”

Input positions are separated by a space, different word options at one positions are separated by two underscores. Each word option has a weight appended to it, separated by “|”. If “__ALT__” is missing, the line is interpreted as a regular plain text sentence. For this, all weights are set to 1.0 and only one word option is used at each position. Epsilon arcs of confusion networks can be represented by a special token (e.g. “EPS”), which has to be added to the source vocabulary.

Via “seq_list_file” (see TranslationDataset) it is possible to give an explicit order of training examples. This can e.g. be used to repeat the confusion net part of the training data without loading it several times.

Parameters:
  • path (str) – the directory containing the files
  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
  • source_postfix (None|str) – will concat this at the end of the source. e.g.
  • target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
  • source_only (bool) – if targets are not available
  • unknown_label (str|None) – “UNK” or so. if not given, then will not replace unknowns but throw an error
  • max_density (int) – the density of the confusion network: max number of arcs per slot
MapToDataKeys = {'source': 'sparse_inputs', 'target': 'classes'}[source]
get_data_keys(self)[source]
Return type:list[str]
is_data_sparse(self, key)[source]
Parameters:key (str) –
Return type:bool
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
get_data_shape(self, key)[source]
Parameters:key (str) –
Return type:list[int]
LmDataset.expand_abbreviations(text)[source]
Parameters:text (str) –
Return type:str
LmDataset.lowercase(text)[source]
Parameters:text (str) –
Return type:str
LmDataset.collapse_whitespace(text)[source]
Parameters:text (str) –
Return type:str
LmDataset.convert_to_ascii(text)[source]
Parameters:text (str) –
Return type:str
LmDataset.basic_cleaners(text)[source]

Basic pipeline that lowercases and collapses whitespace without transliteration.

Parameters:text (str) –
Return type:str
LmDataset.transliteration_cleaners(text)[source]

Pipeline for non-English text that transliterates to ASCII.

Parameters:text (str) –
Return type:str
LmDataset.english_cleaners(text)[source]

Pipeline for English text, including number and abbreviation expansion. :param str text: :rtype: str

LmDataset.get_remove_chars(chars)[source]
Parameters:chars (str|list[str]) –
Return type:(str)->str
LmDataset.normalize_numbers(text)[source]
Parameters:text (str) –
Return type:str
LmDataset.get_post_processor_function(opts)[source]

You might want to use inflect or unidecode for some normalization / cleanup. This function can be used to get such functions.

Parameters:opts (str|list[str]) – e.g. “english_cleaners”, or “get_remove_chars(‘,/’)”
Returns:function
Return type:(str)->str