`returnn.util.bpe`¶

Provide basic Byte-Pair-Encoding (BPE) utilities.

class returnn.util.bpe.StandardBytePairEncoder(bpe_codes_file, labels=None)[source]¶

Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.

Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:

bpe_codes_file (str) – codes file
labels (list[str]|None) – vocab

segment_sentence(sentence)[source]¶

Segment single sentence (whitespace-tokenized string) with BPE encoding.

Parameters:: sentence (str)
Return type:: list[str]

class returnn.util.bpe.BpeOpts(label_postfix_merge_symbol: str | None = None, word_prefix_symbol: str | None = None)[source]¶

Options, should allow for both subword-nmt BPE and SentencePiece BPE/Unigram.

label_postfix_merge_symbol: str | None = None[source]¶

word_prefix_symbol: str | None = None[source]¶

class returnn.util.bpe.PrefixTree(*, prefix: str = '', opts: BpeOpts)[source]¶

Prefix tree / trie. This class represents both a single node and the tree.

Parameters:: prefix – if this is not the root, the prefix to get here

add(postfix: str) → PrefixTree[source]¶

Parameters:: postfix

class returnn.util.bpe.Hyp(bpe_sym_history: List[str], cur_node: PrefixTree)[source]¶

Represents a hypothesis in the search.

bpe_sym_history: List[str][source]¶

cur_node: PrefixTree[source]¶

class returnn.util.bpe.CharSyncSearch(bpe: PrefixTree, word: str, word_pos: int = 0)[source]¶

Covers the search hyps and the search itself.

Parameters:

bpe
word
word_pos

search() → List[List[str]][source]¶

Returns:: collection of possible BPE symbol seqs

class returnn.util.bpe.HypInPos(bpe_sym_history: List[str], cur_node: PrefixTree, pos: int)[source]¶

Represents a hypothesis in the search.

bpe_sym_history: List[str][source]¶

cur_node: PrefixTree[source]¶

pos: int[source]¶

class returnn.util.bpe.DepthFirstSearch(bpe: PrefixTree, word: str, sampler: Callable[[], bool] | None = None)[source]¶

Depth-first search.

Parameters:

bpe
word
sampler

search() → List[str] | None[source]¶

Returns:: BPE symbol seq if one is found

class returnn.util.bpe.SamplingBytePairEncoder(*, labels: List[str], breadth_prob: float, rnd: RandomState, unknown_label: str | None = None, opts: BpeOpts)[source]¶

Will randomly sample from any possible BPE split.

Parameters:

labels – vocab
breadth_prob – 1.0 will lead to breadth-first search, 0.0 to depth-first search. other values are stochastic.
rnd
unknown_label
opts

get_bpe_split_for_word(word: str) → List[str] | None[source]¶

Parameters:: word

segment_sentence(sentence: str) → List[str][source]¶

Segment single sentence (whitespace-tokenized string) with BPE encoding.

Parameters:: sentence

returnn.util.bpe¶

`returnn.util.bpe`¶