Text Datasets#

Language Model Dataset#

class LmDataset.LmDataset(corpus_file, skip_empty_lines=True, orth_symbols_file=None, orth_symbols_map_file=None, orth_replace_map_file=None, word_based=False, word_end_symbol=None, seq_end_symbol='[END]', unknown_symbol='[UNKNOWN]', parse_orth_opts=None, phone_info=None, add_random_phone_seqs=0, auto_replace_unknown_symbol=False, log_auto_replace_unknown_symbols=10, log_skipped_seqs=10, error_on_invalid_seq=True, add_delayed_seq_data=False, delayed_seq_data_start_symbol='[START]', **kwargs)[source]#

Bases: CachedDataset2

Dataset useful for language modeling. It creates index sequences for either words, characters or other orthographics symbols based on a vocabulary. Can also perform internal word to phoneme conversion with a lexicon file. Reads simple txt files or bliss xml files (also gzipped).

To use the LmDataset with words or characters, either orth_symbols_file or orth_symbols_map_file has to be specified (both is not possible). If words should be used, set word_based to True.

The LmDatasets also support the conversion of words to phonemes with the help of the LmDataset.PhoneSeqGenerator class. To enable this mode, the input parameters to LmDataset.PhoneSeqGenerator have to be provided as dict in phone_info. As a lexicon file has to specified in this dict, orth_symbols_file and orth_symbols_map_file are not used in this case.

The LmDataset does not work without providing a vocabulary with any of the above mentioned ways.

After initialization, the corpus is represented by self.orths (as a list of sequences). The vocabulary is given by self.orth_symbols and self.orth_symbols_map gives the corresponding mapping from symbol to integer index (in case phone_info is not set).

Parameters:
  • corpus_file (str|()->str|list[str]|()->list[str]) – Bliss XML or line-based txt. optionally can be gzip.

  • skip_empty_lines (bool) – for line-based txt

  • orth_symbols_file (str|()->str|None) – a text file containing a list of orthography symbols

  • orth_symbols_map_file (str|()->str|None) – either a list of orth symbols, each line: “<symbol> <index>”, a python dict with {“<symbol>”: <index>, …} or a pickled dictionary

  • orth_replace_map_file (str|()->str|None) – JSON file with replacement dict for orth symbols.

  • word_based (bool) – whether to parse single words, or otherwise will be character based.

  • word_end_symbol (str|None) – If provided and if word_based is False (character based modeling), token to be used to represent word ends.

  • seq_end_symbol (str|None) – what to add at the end, if given. will be set as postfix=[seq_end_symbol] or postfix=[] for parse_orth_opts.

  • unknown_symbol (str|None) – token to represent unknown words.

  • parse_orth_opts (dict[str]|None) – kwargs for parse_orthography().

  • phone_info (dict|None) – A dict containing parameters including a lexicon file for LmDataset.PhoneSeqGenerator.

  • add_random_phone_seqs (int) – will add random seqs with the same len as the real seq as additional data.

  • log_auto_replace_unknown_symbols (bool|int) – write about auto-replacements with unknown symbol. if this is an int, it will only log the first N replacements, and then keep quiet.

  • log_skipped_seqs (bool|int) – write about skipped seqs to logging, due to missing lexicon entry or so. if this is an int, it will only log the first N entries, and then keep quiet.

  • error_on_invalid_seq (bool) – if there is a seq we would have to skip, error.

  • add_delayed_seq_data (bool) – will add another data-key “delayed” which will have the sequence. delayed_seq_data_start_symbol + original_sequence[:-1].

  • delayed_seq_data_start_symbol (str) – used for add_delayed_seq_data.

Translation Dataset#

class LmDataset.TranslationDataset(path, file_postfix, source_postfix='', target_postfix='', source_only=False, search_without_reference=False, unknown_label=None, seq_list_file=None, use_cache_manager=False, **kwargs)[source]#

Bases: CachedDataset2

Based on the conventions by our team for translation datasets. It gets a directory and expects these files:

  • source.dev(.gz)

  • source.train(.gz)

  • source.vocab.pkl

  • target.dev(.gz)

  • target.train(.gz)

  • target.vocab.pkl

The convention is to use “dev” and “train” as file_postfix for the dev and train set respectively, but any file_postfix can be used. The target file and vocabulary do not have to exists when setting source_only. It is also automatically checked if a gzip version of the file exists.

To follow the RETURNN conventions on data input and output, the source text is mapped to the “data” key, and the target text to the “classes” data key. Both are index sequences.

Parameters:
  • path (str) – the directory containing the files

  • file_postfix (str) – e.g. “train” or “dev”. it will then search for “source.” + postfix and “target.” + postfix.

  • random_shuffle_epoch1 (bool) – if True, will also randomly shuffle epoch 1. see self.init_seq_order().

  • source_postfix (str) – will concat this at the end of the source.

  • target_postfix (str) – will concat this at the end of the target. You might want to add some sentence-end symbol.

  • source_only (bool) – if targets are not available

  • search_without_reference (bool) –

  • unknown_label (str|dict[str,str]|None) – Label to replace out-of-vocabulary words with, e.g. “<UNK>”. If not given, will not replace unknowns but throw an error. Can also be a dict data_key -> unknown_label to configure for each data key separately (default for each key is None).

  • seq_list_file (str) – filename. line-separated list of line numbers defining fixed sequence order. multiple occurrences supported, thus allows for repeating examples while loading only once.

  • use_cache_manager (bool) – uses Util.cf() for files