returnn.datasets.util.vocabulary¶
Vocabulary related classes for targets such as BPE, SentencePieces etc…
- class returnn.datasets.util.vocabulary.Vocabulary(vocab_file: str | None, *, special_symbols_via_file: str | None = None, unknown_label: str | int | None = <class 'returnn.util.basic.NotSpecified'>, bos_label: str | int | None = None, eos_label: str | int | None = None, pad_label: str | int | None = None, control_symbols: ~typing.Dict[str, str | int] | None = None, user_defined_symbols: ~typing.Dict[str, str | int] | None = None, num_labels: int | None = None, seq_postfix: ~typing.List[int] | None = None, labels: ~typing.List[str] | ~typing.Callable[[], ~typing.List[str]] | None = None, single_whitespace_split: bool = False)[source]¶
Represents a vocabulary (set of words, and their ids). Used by
BytePairEncoding.- Parameters:
vocab_file
special_symbols_via_file – if given, the file is supposed to contain a dict with potential keys “unknown_label”, “bos_label”, “eos_label”, “pad_label”, “control_symbols”, “user_defined_symbols”. When label are specified directly as kwargs, those take precedence over any option in the file.
unknown_label – e.g. “UNK” or “<unk>”
bos_label – e.g. “<s>”
eos_label – e.g. “</s>”
pad_label – e.g. “<pad>”
control_symbols – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
num_labels – just for verification
seq_postfix – labels will be added to the seq in self.get_seq
labels
single_whitespace_split – Assume that the given text is encoded using
" ".join(labels[i] for i in seq), and this will undo that. This makes a difference when there is whitespace itself in the vocab (inlabels). If not enabled (the default), this will simply usestr.split().
- set_random_seed(seed: int)[source]¶
This can be called for a new epoch or so. Usually it has no effect, as there is no randomness. However, some vocab class could introduce some sampling process.
- Parameters:
seed
- classmethod create_vocab_dict_from_labels(labels)[source]¶
This is exactly the format which we expect when we read it in self._parse_vocab.
- Parameters:
labels (list[str])
- Return type:
dict[str,int]
- classmethod create_vocab_from_labels(labels, **kwargs)[source]¶
Creates a Vocabulary from the given labels. Depending on whether the labels are identified as bytes, characters or words a Utf8ByteTargets, CharacterTargets or Vocabulary vocab is created.
- Parameters:
labels (list[str])
- Return type:
- tf_get_init_variable_func(var)[source]¶
- Parameters:
var (tensorflow.Variable)
- Return type:
(tensorflow.Session)->None
- to_id(label: str | int | None, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>, allow_none: bool = False) int | None[source]¶
- Parameters:
label
default
allow_none – whether label can be None. in this case, None is returned
- label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) int | None[source]¶
- Parameters:
label
default
- id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) str | None[source]¶
- Parameters:
idx
default
- get_seq(sentence: str) List[int][source]¶
- Parameters:
sentence – assumed to be seq of vocab entries separated by whitespace
- Returns:
seq of label indices
- get_seq_indices(seq: List[str]) List[int][source]¶
- Parameters:
seq – seq of labels (entries in vocab)
- Returns:
seq of label indices, returns unknown_label_id if unknown_label is set
- get_seq_labels(seq: List[int] | ndarray) str[source]¶
Inverse of
get_seq().- Parameters:
seq – 1D sequence of label indices
- Returns:
serialized sequence string, such that
get_seq(get_seq_labels(seq)) == seq
- serialize_labels(data: ndarray) str[source]¶
Like
get_seq_labels()but a bit more generic, to not just work on sequences, but any shape.Also like
Dataset.serialize_data()but even slightly more generic.
- class returnn.datasets.util.vocabulary.BytePairEncoding(vocab_file, bpe_file, seq_postfix=None, **kwargs)[source]¶
Vocab based on Byte-Pair-Encoding (BPE). This will encode the text on-the-fly with BPE.
Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
- Parameters:
vocab_file (str)
bpe_file (str)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
- class returnn.datasets.util.vocabulary.SamplingBytePairEncoding(vocab_file: str, breadth_prob: float, seq_postfix: ~typing.List[int] | None = None, label_postfix_merge_symbol: str | None = <class 'returnn.util.basic.NotSpecified'>, word_prefix_symbol: str | None = <class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶
Vocab based on Byte-Pair-Encoding (BPE). Like
BytePairEncoding, but here we randomly sample from different possible BPE splits. This will encode the text on-the-fly with BPE.- Parameters:
vocab_file
breadth_prob
seq_postfix – labels will be added to the seq in self.get_seq
label_postfix_merge_symbol – If given, will use this as label postfix merge symbol, i.e. when this occurs at the end of a label, it is supposed to be merged with the next label, i.e. the space between them is removed and is not a word boundary. If None, will not use any postfix merge symbol. If not specified, and also word_prefix_symbol is not specified, will use “@@” by default here, the standard from subword-nmt, and our original behavior.
word_prefix_symbol – If given, every new word starts with this symbol. This also implies that there are no spaces between words and this symbol is a placeholder for the space. If None, will not use this logic. For SentencePiece, you usually would use “▁” here.
- class returnn.datasets.util.vocabulary.SentencePieces(**opts)[source]¶
Uses the SentencePiece software, which supports different kind of subword units (including BPE, unigram, …).
https://github.com/google/sentencepiece/ https://github.com/google/sentencepiece/tree/master/python
Dependency:
pip3 install --user sentencepiece
- Parameters:
model_file (str) – The sentencepiece model file path.
model_proto (str) – The sentencepiece model serialized proto.
out_type (type) – output type. int or str. (Default = int)
add_bos (bool) – Add <s> to the result (Default = false)
add_eos (bool) – Add </s> to the result (Default = false) <s>/</s> is added after reversing (if enabled).
reverse (bool) – Reverses the tokenized sequence (Default = false)
enable_sampling (bool) – (Default = false)
nbest_size (int) –
sampling parameters for unigram. Invalid for BPE-Dropout. nbest_size = {0,1}: No sampling is performed. nbest_size > 1: samples from the nbest_size results. nbest_size < 0: (Default). assuming that nbest_size is infinite and samples
from the all hypothesis (lattice) using forward-filtering-and-backward-sampling algorithm.
alpha (float) – Soothing parameter for unigram sampling, and dropout probability of merge operations for BPE-dropout. (Default = 0.1)
control_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
user_defined_symbols (dict[str,str|int]|None) – https://github.com/google/sentencepiece/blob/master/doc/special_symbols.md
- id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) str | None[source]¶
- Parameters:
idx
default
- class returnn.datasets.util.vocabulary.CharacterTargets(vocab_file, seq_postfix=None, unknown_label='@', labels=None, **kwargs)[source]¶
Uses characters as target labels. Also see
Utf8ByteTargets.- Parameters:
vocab_file (str|None)
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
unknown_label (str|None)
labels (list[str]|None)
- class returnn.datasets.util.vocabulary.Utf8ByteTargets(seq_postfix=None, **opts)[source]¶
Uses bytes as target labels from UTF8 encoded text. All bytes (0-255) are allowed. Also see
CharacterTargets.- Parameters:
seq_postfix (list[int]|None) – labels will be added to the seq in self.get_seq
- class returnn.datasets.util.vocabulary.HuggingFaceTokenizer(*, huggingface_repo_dir: str | None = None, tokenizer: transformers.PreTrainedTokenizerBase | None = None, map_bos_to_eos: bool = False, text_preprocessing: Callable[[str], str] | None = None, bpe_dropout: float = 0.0)[source]¶
Uses the AutoTokenizer class from the transformers package.
- Parameters:
huggingface_repo_dir – the directory containing the tokenizer_config.json file.
tokenizer – if given, will use this tokenizer directly. Otherwise, will load it from huggingface_repo_dir.
map_bos_to_eos
text_preprocessing – applied in
get_seq()(sentence -> ids)
- id_to_label(idx: int, default: str | ~typing.Type[KeyError] | None = <class 'KeyError'>) str | None[source]¶
- Parameters:
idx
default
- label_to_id(label: str, default: int | ~typing.Type[KeyError] | None = <class 'KeyError'>) int | None[source]¶
- Parameters:
label
default