LmDataset

class LmDataset.AllophoneState(id=None, state=None)[source]
Parameters:
  • id (str) – phone
  • state (int|None) –
boundary = 0[source]
context_future = ()[source]
context_history = ()[source]
copy()[source]
format()[source]
classmethod from_classic_index(index, allophones, max_states=6)[source]

Via Sprint C++ Archiver.cc:getStateInfo():

const u32 max_states = 6; // TODO: should be increased for non-speech for (state = 0; state < max_states; ++state) {

if (emission >= allophones_.size()) emission -= (1<<26); else break;

}

Parameters:
Return type:

AllophoneState

classmethod from_index(index, phone_ids, num_states=3, context_length=1)[source]

Original Sprint C++ code:

Mm::MixtureIndex NoStateTyingDense::classify(const AllophoneState& a) const {

require_lt(a.allophone()->boundary, numBoundaryClasses_); require_le(0, a.state()); require_lt(u32(a.state()), numStates_); u32 result = 0; for(u32 i = 0; i < 2 * contextLength_ + 1; ++i) { // context len is usually 1

// pos sequence: 0, -1, 1, [-2, 2, ...] s16 pos = i / 2; if(i % 2 == 1)

pos = -pos - 1;

result *= numPhoneClasses_; u32 phoneIdx = a.allophone()->phoneme(pos); require_lt(phoneIdx, numPhoneClasses_); result += phoneIdx;

} result *= numStates_; result += u32(a.state()); result *= numBoundaryClasses_; result += a.allophone()->boundary; require_lt(result, nClasses_); return result;

}

Note that there is also AllophoneStateAlphabet::allophoneState, via Am/ClassicStateModel.cc, which unfortunately uses a different encoding. See from_classic_index().

Parameters:
  • index (int) –
  • phone_ids (dict[int,str]) – reverse-map from self.index(). idx -> id
  • num_states (int) – how much state per allophone
  • context_length (int) – how much left/right context
Return type:

int

Return type:

AllophoneState

id = None[source]
index(phone_idxs, num_states=3, context_length=1)[source]

See self.from_index() for the inverse function. And see Sprint NoStateTyingDense::classify().

Parameters:
  • phone_idxs (dict[str,int]) –
  • num_states (int) – how much state per allophone
  • context_length (int) – how much left/right context
Return type:

int

mark_final()[source]
mark_initial()[source]
phone_idx(ctx_offset, phone_idxs)[source]
Parameters:
  • ctx_offset (int) – see self.phoneme()
  • phone_idxs (dict[str,int]) –
Return type:

int

phoneme(ctx_offset, out_of_context_id=None)[source]
Phoneme::Id ContextPhonology::PhonemeInContext::phoneme(s16 pos) const {
if (pos == 0)
return phoneme_;
else if (pos > 0) {
if (u16(pos - 1) < context_.future.length())
return context_.future[pos - 1];
else
return Phoneme::term;
} else { verify(pos < 0);
if (u16(-1 - pos) < context_.history.length())
return context_.history[-1 - pos];
else
return Phoneme::term;

}

}

Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history
  • out_of_context_id (str|None) – what to return out of our context
Returns:

phone-id from the offset

Return type:

str

set_phoneme(ctx_offset, phone_id)[source]
Parameters:
  • ctx_offset (int) – 0 for center, >0 for future, <0 for history
  • phone_id (str) –
state = None[source]
class LmDataset.Lexicon(filename)[source]
class LmDataset.LmDataset(corpus_file, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, partition_epoch=1, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]
Parameters:
  • corpus_file (str|()->str) – Bliss XML or line-based txt. optionally can be gzip.
  • phone_info (dict|None) – if you want to get phone seqs, dict with lexicon_file etc. see PhoneSeqGenerator
  • orth_symbols_file (str|()->str|None) – list of orthography symbols, if you want to get orth symbol seqs
  • orth_symbols_map_file (str|()->str|None) – list of orth symbols, each line: “symbol index”
  • orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols
  • word_based (bool) – whether to parse single words, or otherwise will be char-based
  • seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
  • parse_orth_opts (dict[str]|None) – kwargs for parse_orthography()
  • add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data
  • log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
  • log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
  • error_on_invalid_seq (bool) – if there is a seq we would have to skip, error
  • add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence delayed_seq_data_start_symbol + original_sequence[:-1]
  • delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data
  • partition_epoch (int) – whether to partition the epochs into multiple parts. like epoch_split
get_data_dtype(key)[source]
get_target_list()[source]
init_seq_order(epoch=None, seq_list=None)[source]
class LmDataset.PhoneSeqGenerator(lexicon_file, allo_num_states=3, allo_context_len=1, state_tying_file=None, add_silence_beginning=0.1, add_silence_between_words=0.1, add_silence_end=0.1, repetition=0.9, silence_repetition=0.95)[source]
Parameters:
  • lexicon_file (str) – lexicon XML file
  • allo_num_states (int) – how much HMM states per allophone (all but silence)
  • allo_context_len (int) – how much context to store left and right. 1 -> triphone
  • | None state_tying_file (str) – for state-tying, if you want that
  • add_silence_beginning (float) – prob of adding silence at beginning
  • add_silence_between_words (float) – prob of adding silence between words
  • add_silence_end (float) – prob of adding silence at end
  • repetition (float) – prob of repeating an allophone
  • silence_repetition (float) – prob of repeating the silence allophone
generate_garbage_seq(target_len)[source]
Parameters:target_len (int) – len of the returned seq
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().

generate_seq(orth)[source]
Parameters:orth (str) – orthography as a str. orth.split() should give words in the lexicon
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc

get_class_labels()[source]
orth_to_phones(orth)[source]
random_seed(seed)[source]
seq_to_class_idxs(phones, dtype=None)[source]
Parameters:
  • phones (list[AllophoneState]) – list of allophone states
  • dtype (str) – eg “int32”
Return type:

numpy.ndarray

:returns 1D numpy array with the indices

class LmDataset.StateTying(state_tying_file)[source]
class LmDataset.TranslationDataset(path, file_postfix, partition_epoch=None, target_postfix='', **kwargs)[source]

Based on the conventions by our team for translation datasets. It gets a directory and expects these files:

source.dev(.gz)? source.train(.gz)? source.vocab.pkl target.dev(.gz)? target.train(.gz)? target.vocab.pkl
Parameters:
  • path (str) – the directory containing the files
  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.
  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().
  • partition_epoch (int) – if provided, will partition the dataset into multiple epochs
  • target_postfix (None|str) – will concat this at the end of the target. You might want to add some sentence-end symbol.
MapToDataKeys = {'source': 'data', 'target': 'classes'}[source]
get_data_dtype(key)[source]
init_seq_order(epoch=None, seq_list=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

is_data_sparse(key)[source]