`returnn.datasets.util.vocabulary`¶

Vocabulary related classes for targets such as BPE, SentencePieces etc…

Represents a vocabulary (set of words, and their ids). Used by BytePairEncoding.

Parameters:

vocab_file
special_symbols_via_file – if given, the file is supposed to contain a dict with potential keys “unknown_label”, “bos_label”, “eos_label”, “pad_label”, “control_symbols”, “user_defined_symbols”. When label are specified directly as kwargs, those take precedence over any option in the file.
unknown_label – e.g. “UNK” or “<unk>”
bos_label – e.g. “<s>”
eos_label – e.g. “</s>”
pad_label – e.g. “<pad>”
control_symbols – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
num_labels – just for verification
seq_postfix – labels will be added to the seq in self.get_seq
labels
single_whitespace_split – Assume that the given text is encoded using " ".join(labels[i] for i in seq), and this will undo that. This makes a difference when there is whitespace itself in the vocab (in labels). If not enabled (the default), this will simply use str.split().

classmethod create_vocab(**opts)[source]¶

Parameters:: opts – kwargs for class
Return type:: Vocabulary|BytePairEncoding|CharacterTargets

set_random_seed(seed: int)[source]¶

This can be called for a new epoch or so. Usually it has no effect, as there is no randomness. However, some vocab class could introduce some sampling process.

Parameters:: seed

classmethod create_vocab_dict_from_labels(labels)[source]¶

This is exactly the format which we expect when we read it in self._parse_vocab.

Parameters:: labels (list[str])
Return type:: dict[str,int]

classmethod create_vocab_from_labels(labels, **kwargs)[source]¶

Creates a Vocabulary from the given labels. Depending on whether the labels are identified as bytes, characters or words a Utf8ByteTargets, CharacterTargets or Vocabulary vocab is created.

Parameters:: labels (list[str])
Return type:: Vocabulary

tf_get_init_variable_func(var)[source]¶

Parameters:: var (tensorflow.Variable)
Return type:: (tensorflow.Session)->None

Parameters:

label
default
allow_none – whether label can be None. in this case, None is returned

label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) → int | None[source]¶

Parameters:

label
default

id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) → str | None[source]¶

Parameters:

idx
default

is_id_valid(idx: int) → bool[source]¶

Parameters:: idx

property labels: List[str][source]¶: list of labels

get_seq(sentence: str) → List[int][source]¶

Parameters:: sentence – assumed to be seq of vocab entries separated by whitespace
Returns:: seq of label indices

get_seq_indices(seq: List[str]) → List[int][source]¶

Parameters:: seq – seq of labels (entries in vocab)
Returns:: seq of label indices, returns unknown_label_id if unknown_label is set

get_seq_labels(seq: List[int] | ndarray) → str[source]¶

Inverse of get_seq().

Parameters:: seq – 1D sequence of label indices
Returns:: serialized sequence string, such that get_seq(get_seq_labels(seq)) == seq

serialize_labels(data: ndarray) → str[source]¶

Like get_seq_labels() but a bit more generic, to not just work on sequences, but any shape.

Also like Dataset.serialize_data() but even slightly more generic.

class returnn.datasets.util.vocabulary.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, **kwargs)[source]¶

Vocab based on Byte-Pair-Encoding (BPE). This will encode the text on-the-fly with BPE.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:

vocab_file (str)
bpe_file (str)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

class returnn.datasets.util.vocabulary.SamplingBytePairEncoding(vocab_file: str, breadth_prob: float, seq_postfix: ~typing.List[int] | None = None, label_postfix_merge_symbol: str | None = <class 'returnn.util.basic.NotSpecified'>, word_prefix_symbol: str | None = <class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶

Vocab based on Byte-Pair-Encoding (BPE). Like BytePairEncoding, but here we randomly sample from different possible BPE splits. This will encode the text on-the-fly with BPE.

Parameters:

vocab_file
breadth_prob
seq_postfix – labels will be added to the seq in self.get_seq
label_postfix_merge_symbol – If given, will use this as label postfix merge symbol, i.e. when this occurs at the end of a label, it is supposed to be merged with the next label, i.e. the space between them is removed and is not a word boundary. If None, will not use any postfix merge symbol. If not specified, and also word_prefix_symbol is not specified, will use “@@” by default here, the standard from subword-nmt, and our original behavior.
word_prefix_symbol – If given, every new word starts with this symbol. This also implies that there are no spaces between words and this symbol is a placeholder for the space. If None, will not use this logic. For SentencePiece, you usually would use “▁” here.

set_random_seed(seed)[source]¶

Parameters:: seed (int)

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

class returnn.datasets.util.vocabulary.SentencePieces(**opts)[source]¶

Uses the SentencePiece software, which supports different kind of subword units (including BPE, unigram, …).

https://github.com/google/sentencepiece/ https://github.com/google/sentencepiece/tree/master/python

Dependency:

pip3 install --user sentencepiece

Parameters:

model_file (str) – The sentencepiece model file path.
model_proto (str) – The sentencepiece model serialized proto.
out_type (type) – output type. int or str. (Default = int)
add_bos (bool) – Add <s> to the result (Default = false)
add_eos (bool) – Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
reverse (bool) – Reverses the tokenized sequence (Default = false)
enable_sampling (bool) – (Default = false)
nbest_size (int) –
sampling parameters for unigram. Invalid for BPE-Dropout. nbest_size = {0,1}: No sampling is performed. nbest_size > 1: samples from the nbest_size results. nbest_size < 0: (Default). assuming that nbest_size is infinite and samples

from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float) – Soothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. (Default = 0.1)
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

property labels: List[str][source]¶: list of labels

is_id_valid(idx: int) → bool[source]¶

Parameters:: idx

id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) → str | None[source]¶

Parameters:

idx
default

label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) → int | None[source]¶

Parameters:

label
default

set_random_seed(seed: int)[source]¶

Parameters:: seed

get_seq(sentence: str) → List[int][source]¶

Parameters:: sentence – assumed to be seq of vocab entries separated by whitespace

class returnn.datasets.util.vocabulary.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@', labels=None, **kwargs)[source]¶

Uses characters as target labels. Also see Utf8ByteTargets.

Parameters:

vocab_file (str|None)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
unknown_label (str|None)
labels (list[str]|None)

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

get_seq_labels(seq)[source]¶

Parameters:: seq (list[int]|numpy.ndarray) – 1D sequence
Return type:: str

class returnn.datasets.util.vocabulary.Utf8ByteTargets(seq_postfix=None, **opts)[source]¶

Uses bytes as target labels from UTF8 encoded text. All bytes (0-255) are allowed. Also see CharacterTargets.

Parameters:: seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

get_seq_labels(seq)[source]¶

Parameters:: seq (list[int]|numpy.ndarray) – 1D sequence
Return type:: str

class returnn.datasets.util.vocabulary.HuggingFaceTokenizer(*, huggingface_repo_dir: str | None = None, tokenizer: transformers.PreTrainedTokenizerBase | None = None, map_bos_to_eos: bool = False, text_preprocessing: Callable[[str], str] | None = None, bpe_dropout: float = 0.0)[source]¶

Uses the AutoTokenizer class from the transformers package.

Parameters:

huggingface_repo_dir – the directory containing the tokenizer_config.json file.
tokenizer – if given, will use this tokenizer directly. Otherwise, will load it from huggingface_repo_dir.
map_bos_to_eos
text_preprocessing – applied in get_seq() (sentence -> ids)

property labels: List[str][source]¶: list of labels

is_id_valid(idx: int) → bool[source]¶

Parameters:: idx

id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) → str | None[source]¶

Parameters:

idx
default

label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) → int | None[source]¶

Parameters:

label
default

get_seq(sentence: str) → List[int][source]¶

Parameters:: sentence – assumed to be seq of vocab entries separated by whitespace

get_seq_labels(seq)[source]¶

Parameters:: seq (list[int]|numpy.ndarray) – 1D sequence
Return type:: str

returnn.datasets.util.vocabulary¶

`returnn.datasets.util.vocabulary`¶