returnn.datasets.util.vocabulary#

Vocabulary related classes for targets such as BPE, SentencePieces etc…

class returnn.datasets.util.vocabulary.Vocabulary(vocab_file, seq_postfix=None, unknown_label='UNK', bos_label=None, eos_label=None, pad_label=None, control_symbols=None, user_defined_symbols=None, num_labels=None, labels=None)[source]#

Represents a vocabulary (set of words, and their ids). Used by BytePairEncoding.

Parameters:
classmethod create_vocab(**opts)[source]#
Parameters:

opts – kwargs for class

Return type:

Vocabulary|BytePairEncoding|CharacterTargets

set_random_seed(seed)[source]#

This can be called for a new epoch or so. Usually it has no effect, as there is no randomness. However, some vocab class could introduce some sampling process.

Parameters:

seed (int) –

classmethod create_vocab_dict_from_labels(labels)[source]#

This is exactly the format which we expect when we read it in self._parse_vocab.

Parameters:

labels (list[str]) –

Return type:

dict[str,int]

classmethod create_vocab_from_labels(labels, **kwargs)[source]#

Creates a Vocabulary from the given labels. Depending on whether the labels are identified as bytes, characters or words a Utf8ByteTargets, CharacterTargets or Vocabulary vocab is created.

Parameters:

labels (list[str]) –

Return type:

Vocabulary

tf_get_init_variable_func(var)[source]#
Parameters:

var (tensorflow.Variable) –

Return type:

(tensorflow.Session)->None

to_id(label, default=<class 'KeyError'>, allow_none=False)[source]#
Parameters:
  • label (str|int|None) –

  • default (str|type[KeyError]|None) –

  • allow_none (bool) – whether label can be None. in this case, None is returned

Return type:

int|None

label_to_id(label, default=<class 'KeyError'>)[source]#
Parameters:
  • label (str) –

  • default (int|type[KeyError]|None) –

Return type:

int|None

id_to_label(idx, default=<class 'KeyError'>)[source]#
Parameters:
  • idx (int) –

  • default (str|KeyError|None) –

Return type:

str|None

is_id_valid(idx)[source]#
Parameters:

idx (int) –

Return type:

bool

property labels[source]#
Return type:

list[str]

get_seq(sentence)[source]#
Parameters:

sentence (str) – assumed to be seq of vocab entries separated by whitespace

Return type:

list[int]

get_seq_indices(seq)[source]#
Parameters:

seq (list[str]) –

Return type:

list[int]

get_seq_labels(seq)[source]#
Parameters:

seq (list[int]|numpy.ndarray) – 1D sequence

Return type:

str

class returnn.datasets.util.vocabulary.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, **kwargs)[source]#

Vocab based on Byte-Pair-Encoding (BPE). This will encode the text on-the-fly with BPE.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:
  • vocab_file (str) –

  • bpe_file (str) –

  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]#
Parameters:

sentence (str) –

Return type:

list[int]

class returnn.datasets.util.vocabulary.SamplingBytePairEncoding(vocab_file, breadth_prob, seq_postfix=None, **kwargs)[source]#

Vocab based on Byte-Pair-Encoding (BPE). Like BytePairEncoding, but here we randomly sample from different possible BPE splits. This will encode the text on-the-fly with BPE.

Parameters:
  • vocab_file (str) –

  • breadth_prob (float) –

  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

set_random_seed(seed)[source]#
Parameters:

seed (int) –

get_seq(sentence)[source]#
Parameters:

sentence (str) –

Return type:

list[int]

class returnn.datasets.util.vocabulary.SentencePieces(**opts)[source]#

Uses the SentencePiece software, which supports different kind of subword units (including BPE, unigram, …).

https://github.com/google/sentencepiece/ https://github.com/google/sentencepiece/tree/master/python

Dependency:

pip3 install --user sentencepiece
Parameters:
  • model_file (str) – The sentencepiece model file path.

  • model_proto (str) – The sentencepiece model serialized proto.

  • out_type (type) – output type. int or str. (Default = int)

  • add_bos (bool) – Add <s> to the result (Default = false)

  • add_eos (bool) – Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).

  • reverse (bool) – Reverses the tokenized sequence (Default = false)

  • enable_sampling (bool) – (Default = false)

  • nbest_size (int) –

    sampling parameters for unigram. Invalid for BPE-Dropout. nbest_size = {0,1}: No sampling is performed. nbest_size > 1: samples from the nbest_size results. nbest_size < 0: (Default). assuming that nbest_size is infinite and samples

    from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.

  • alpha (float) – Soothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. (Default = 0.1)

  • control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

  • user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md

property labels[source]#
Return type:

list[str]

is_id_valid(idx)[source]#
Parameters:

idx (int) –

Return type:

bool

id_to_label(idx, default=<class 'KeyError'>)[source]#
Parameters:
  • idx (int) –

  • default (str|KeyError|None) –

Return type:

str|None

label_to_id(label, default=<class 'KeyError'>)[source]#
Parameters:
  • label (str) –

  • default (int|type[KeyError]|None) –

Return type:

int|None

set_random_seed(seed)[source]#
Parameters:

seed (int) –

get_seq(sentence)[source]#
Parameters:

sentence (str) – assumed to be seq of vocab entries separated by whitespace

Return type:

list[int]

class returnn.datasets.util.vocabulary.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@', labels=None, **kwargs)[source]#

Uses characters as target labels. Also see Utf8ByteTargets.

Parameters:
  • vocab_file (str|None) –

  • seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

  • unknown_label (str|None) –

  • labels (list[str]|None) –

get_seq(sentence)[source]#
Parameters:

sentence (str) –

Return type:

list[int]

get_seq_labels(seq)[source]#
Parameters:

seq (list[int]|numpy.ndarray) – 1D sequence

Return type:

str

class returnn.datasets.util.vocabulary.Utf8ByteTargets(seq_postfix=None)[source]#

Uses bytes as target labels from UTF8 encoded text. All bytes (0-255) are allowed. Also see CharacterTargets.

Parameters:

seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq

get_seq(sentence)[source]#
Parameters:

sentence (str) –

Return type:

list[int]

get_seq_labels(seq)[source]#
Parameters:

seq (list[int]|numpy.ndarray) – 1D sequence

Return type:

str