class LmDataset.AllophoneState(id=None, state=None)[source]
boundary = 0[source]
context_future = ()[source]
context_history = ()[source]
id = None[source]
state = None[source]
class LmDataset.Lexicon(filename)[source]
class LmDataset.LmDataset(corpus_file, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, partition_epoch=1, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]
  • corpus_file (str|()->str) – Bliss XML or line-based txt. optionally can be gzip.
  • phone_info (dict|None) – if you want to get phone seqs, dict with lexicon_file etc. see PhoneSeqGenerator
  • orth_symbols_file (str|()->str|None) – list of orthography symbols, if you want to get orth symbol seqs
  • orth_symbols_map_file (str|()->str|None) – list of orth symbols, each line: “symbol index”
  • orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols
  • word_based (bool) – whether to parse single words, or otherwise will be char-based
  • seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.
  • parse_orth_opts (dict[str]|None) – kwargs for parse_orthography()
  • add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data
  • log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.
  • log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.
  • error_on_invalid_seq (bool) – if there is a seq we would have to skip, error
  • add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence delayed_seq_data_start_symbol + original_sequence[:-1]
  • delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data
  • partition_epoch (int) – whether to partition the epochs into multiple parts. like epoch_split
init_seq_order(epoch=None, seq_list=None)[source]
class LmDataset.PhoneSeqGenerator(lexicon_file, allo_num_states=3, allo_context_len=1, state_tying_file=None, add_silence_beginning=0.1, add_silence_between_words=0.1, add_silence_end=0.1, repetition=0.9, silence_repetition=0.95)[source]
  • lexicon_file (str) – lexicon XML file
  • allo_num_states (int) – how much HMM states per allophone (all but silence)
  • allo_context_len (int) – how much context to store left and right. 1 -> triphone
  • | None state_tying_file (str) – for state-tying, if you want that
  • add_silence_beginning (float) – prob of adding silence at beginning
  • add_silence_between_words (float) – prob of adding silence between words
  • add_silence_end (float) – prob of adding silence at end
  • repetition (float) – prob of repeating an allophone
  • silence_repetition (float) – prob of repeating the silence allophone
Parameters:target_len (int) – len of the returned seq
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc. It will randomly generate a sequence of phonemes and transform that into a list of allophones in a similar way than generate_seq().

Parameters:orth (str) – orthography as a str. orth.split() should give words in the lexicon
Return type:list[AllophoneState]

:returns allophone state list. those will have repetitions etc

seq_to_class_idxs(phones, dtype=None)[source]
  • phones (list[AllophoneState]) – list of allophone states
  • dtype (str) – eg “int32”
Return type:


:returns 1D numpy array with the indices

class LmDataset.StateTying(state_tying_file)[source]