returnn.datasets.util.vocabulary
¶
Vocabulary related classes for targets such as BPE, SentencePieces etc…
- class returnn.datasets.util.vocabulary.Vocabulary(vocab_file, seq_postfix=None, unknown_label='UNK', bos_label=None, eos_label=None, pad_label=None, control_symbols=None, user_defined_symbols=None, num_labels=None, labels=None)[source]¶
Represents a vocabulary (set of words, and their ids). Used by
BytePairEncoding
.- Parameters:
vocab_file (str|None)
unknown_label (str|int|None) – e.g. “UNK” or “<unk>”
bos_label (str|int|None) – e.g. “<s>”
eos_label (str|int|None) – e.g. “</s>”
pad_label (str|int|None) – e.g. “<pad>”
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
num_labels (int) – just for verification
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
labels (list[str]|(()->list[str])|None)
- set_random_seed(seed: int)[source]¶
This can be called for a new epoch or so. Usually it has no effect, as there is no randomness. However, some vocab class could introduce some sampling process.
- Parameters:
seed
- classmethod create_vocab_dict_from_labels(labels)[source]¶
This is exactly the format which we expect when we read it in self._parse_vocab.
- Parameters:
labels (list[str])
- Return type:
dict[str,int]
- classmethod create_vocab_from_labels(labels, **kwargs)[source]¶
Creates a Vocabulary from the given labels. Depending on whether the labels are identified as bytes, characters or words a Utf8ByteTargets, CharacterTargets or Vocabulary vocab is created.
- Parameters:
labels (list[str])
- Return type:
- tf_get_init_variable_func(var)[source]¶
- Parameters:
var (tensorflow.Variable)
- Return type:
(tensorflow.Session)->None
- to_id(label: str | int | None, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>, allow_none: bool = False) int | None [source]¶
- Parameters:
label
default
allow_none – whether label can be None. in this case, None is returned
- label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) int | None [source]¶
- Parameters:
label
default
- id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) str | None [source]¶
- Parameters:
idx
default
- get_seq(sentence: str) List[int] [source]¶
- Parameters:
sentence – assumed to be seq of vocab entries separated by whitespace
- Returns:
seq of label indices
- class returnn.datasets.util.vocabulary.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, **kwargs)[source]¶
Vocab based on Byte-Pair-Encoding (BPE). This will encode the text on-the-fly with BPE.
Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
- Parameters:
vocab_file (str)
bpe_file (str)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
- class returnn.datasets.util.vocabulary.SamplingBytePairEncoding(vocab_file, breadth_prob, seq_postfix=None, **kwargs)[source]¶
Vocab based on Byte-Pair-Encoding (BPE). Like
BytePairEncoding
, but here we randomly sample from different possible BPE splits. This will encode the text on-the-fly with BPE.- Parameters:
vocab_file (str)
breadth_prob (float)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
- class returnn.datasets.util.vocabulary.SentencePieces(**opts)[source]¶
Uses the SentencePiece software, which supports different kind of subword units (including BPE, unigram, …).
https://github.com/google/sentencepiece/ https://github.com/google/sentencepiece/tree/master/python
Dependency:
pip3 install --user sentencepiece
- Parameters:
model_file (str) – The sentencepiece model file path.
model_proto (str) – The sentencepiece model serialized proto.
out_type (type) – output type. int or str. (Default = int)
add_bos (bool) – Add <s> to the result (Default = false)
add_eos (bool) – Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
reverse (bool) – Reverses the tokenized sequence (Default = false)
enable_sampling (bool) – (Default = false)
nbest_size (int) –
sampling parameters for unigram. Invalid for BPE-Dropout. nbest_size = {0,1}: No sampling is performed. nbest_size > 1: samples from the nbest_size results. nbest_size < 0: (Default). assuming that nbest_size is infinite and samples
from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float) – Soothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. (Default = 0.1)
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
- id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) str | None [source]¶
- Parameters:
idx
default
- class returnn.datasets.util.vocabulary.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@', labels=None, **kwargs)[source]¶
Uses characters as target labels. Also see
Utf8ByteTargets
.- Parameters:
vocab_file (str|None)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
unknown_label (str|None)
labels (list[str]|None)
- class returnn.datasets.util.vocabulary.Utf8ByteTargets(seq_postfix=None)[source]¶
Uses bytes as target labels from UTF8 encoded text. All bytes (0-255) are allowed. Also see
CharacterTargets
.- Parameters:
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq