`returnn.datasets.util.vocabulary`¶

Vocabulary related classes for targets such as BPE, SentencePieces etc…

class returnn.datasets.util.vocabulary.Vocabulary(vocab_file, seq_postfix=None, unknown_label='UNK', bos_label=None, eos_label=None, pad_label=None, control_symbols=None, user_defined_symbols=None, num_labels=None, labels=None)[source]¶

Represents a vocabulary (set of words, and their ids). Used by BytePairEncoding.

Parameters:

vocab_file (str|None)
unknown_label (str|int|None) – e.g. “UNK” or “<unk>”
bos_label (str|int|None) – e.g. “<s>”
eos_label (str|int|None) – e.g. “</s>”
pad_label (str|int|None) – e.g. “<pad>”
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
num_labels (int) – just for verification
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
labels (list[str]|(()->list[str])|None)

classmethod create_vocab(**opts)[source]¶

Parameters:: opts – kwargs for class
Return type:: Vocabulary|BytePairEncoding|CharacterTargets

set_random_seed(seed: int)[source]¶

This can be called for a new epoch or so. Usually it has no effect, as there is no randomness. However, some vocab class could introduce some sampling process.

Parameters:: seed

classmethod create_vocab_dict_from_labels(labels)[source]¶

This is exactly the format which we expect when we read it in self._parse_vocab.

Parameters:: labels (list[str])
Return type:: dict[str,int]

classmethod create_vocab_from_labels(labels, **kwargs)[source]¶

Creates a Vocabulary from the given labels. Depending on whether the labels are identified as bytes, characters or words a Utf8ByteTargets, CharacterTargets or Vocabulary vocab is created.

Parameters:: labels (list[str])
Return type:: Vocabulary

tf_get_init_variable_func(var)[source]¶

Parameters:: var (tensorflow.Variable)
Return type:: (tensorflow.Session)->None

Parameters:

label
default
allow_none – whether label can be None. in this case, None is returned

label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) → int | None[source]¶

Parameters:

label
default

id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) → str | None[source]¶

Parameters:

idx
default

is_id_valid(idx: int) → bool[source]¶

Parameters:: idx

property labels: List[str][source]¶: list of labels

get_seq(sentence: str) → List[int][source]¶

Parameters:: sentence – assumed to be seq of vocab entries separated by whitespace
Returns:: seq of label indices

get_seq_indices(seq: List[str]) → List[int][source]¶

Parameters:: seq – seq of labels (entries in vocab)
Returns:: seq of label indices, returns unknown_label_id if unknown_label is set

get_seq_labels(seq: List[int] | ndarray) → str[source]¶

Inverse of get_seq().

Parameters:: seq – 1D sequence of label indices
Returns:: serialized sequence string, such that get_seq(get_seq_labels(seq)) == seq

class returnn.datasets.util.vocabulary.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, **kwargs)[source]¶

Vocab based on Byte-Pair-Encoding (BPE). This will encode the text on-the-fly with BPE.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:

vocab_file (str)
bpe_file (str)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

class returnn.datasets.util.vocabulary.SamplingBytePairEncoding(vocab_file, breadth_prob, seq_postfix=None, **kwargs)[source]¶

Vocab based on Byte-Pair-Encoding (BPE). Like BytePairEncoding, but here we randomly sample from different possible BPE splits. This will encode the text on-the-fly with BPE.

Parameters:

vocab_file (str)
breadth_prob (float)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

set_random_seed(seed)[source]¶

Parameters:: seed (int)

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

class returnn.datasets.util.vocabulary.SentencePieces(**opts)[source]¶

Uses the SentencePiece software, which supports different kind of subword units (including BPE, unigram, …).

https://github.com/google/sentencepiece/ https://github.com/google/sentencepiece/tree/master/python

Dependency:

pip3 install --user sentencepiece

Parameters:

model_file (str) – The sentencepiece model file path.
model_proto (str) – The sentencepiece model serialized proto.
out_type (type) – output type. int or str. (Default = int)
add_bos (bool) – Add <s> to the result (Default = false)
add_eos (bool) – Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
reverse (bool) – Reverses the tokenized sequence (Default = false)
enable_sampling (bool) – (Default = false)
nbest_size (int) –
sampling parameters for unigram. Invalid for BPE-Dropout. nbest_size = {0,1}: No sampling is performed. nbest_size > 1: samples from the nbest_size results. nbest_size < 0: (Default). assuming that nbest_size is infinite and samples

from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float) – Soothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. (Default = 0.1)
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

property labels: List[str][source]¶: list of labels

is_id_valid(idx: int) → bool[source]¶

Parameters:: idx

id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) → str | None[source]¶

Parameters:

idx
default

label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) → int | None[source]¶

Parameters:

label
default

set_random_seed(seed: int)[source]¶

Parameters:: seed

get_seq(sentence: str) → List[int][source]¶

Parameters:: sentence – assumed to be seq of vocab entries separated by whitespace

class returnn.datasets.util.vocabulary.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@', labels=None, **kwargs)[source]¶

Uses characters as target labels. Also see Utf8ByteTargets.

Parameters:

vocab_file (str|None)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
unknown_label (str|None)
labels (list[str]|None)

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

get_seq_labels(seq)[source]¶

Parameters:: seq (list[int]|numpy.ndarray) – 1D sequence
Return type:: str

class returnn.datasets.util.vocabulary.Utf8ByteTargets(seq_postfix=None)[source]¶

Uses bytes as target labels from UTF8 encoded text. All bytes (0-255) are allowed. Also see CharacterTargets.

Parameters:: seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]¶

Parameters:: sentence (str)
Return type:: list[int]

get_seq_labels(seq)[source]¶

Parameters:: seq (list[int]|numpy.ndarray) – 1D sequence
Return type:: str

returnn.datasets.util.vocabulary¶

`returnn.datasets.util.vocabulary`¶