returnn.util.bpe
¶
Provide basic Byte-Pair-Encoding (BPE) utilities.
- class returnn.util.bpe.StandardBytePairEncoder(bpe_codes_file, labels=None)[source]¶
Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.
Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.
Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
- Parameters:
bpe_codes_file (str) – codes file
labels (list[str]|None) – vocab
- class returnn.util.bpe.BpeOpts(label_postfix_merge_symbol: str | None = None, word_prefix_symbol: str | None = None)[source]¶
Options, should allow for both subword-nmt BPE and SentencePiece BPE/Unigram.
- class returnn.util.bpe.PrefixTree(*, prefix: str = '', opts: BpeOpts)[source]¶
Prefix tree / trie. This class represents both a single node and the tree.
- Parameters:
prefix – if this is not the root, the prefix to get here
- add(postfix: str) PrefixTree [source]¶
- Parameters:
postfix
- class returnn.util.bpe.Hyp(bpe_sym_history: List[str], cur_node: PrefixTree)[source]¶
Represents a hypothesis in the search.
- cur_node: PrefixTree[source]¶
- class returnn.util.bpe.CharSyncSearch(bpe: PrefixTree, word: str, word_pos: int = 0)[source]¶
Covers the search hyps and the search itself.
- Parameters:
bpe
word
word_pos
- class returnn.util.bpe.HypInPos(bpe_sym_history: List[str], cur_node: PrefixTree, pos: int)[source]¶
Represents a hypothesis in the search.
- cur_node: PrefixTree[source]¶
- class returnn.util.bpe.DepthFirstSearch(bpe: PrefixTree, word: str, sampler: Callable[[], bool] | None = None)[source]¶
Depth-first search.
- Parameters:
bpe
word
sampler
- class returnn.util.bpe.SamplingBytePairEncoder(*, labels: List[str], breadth_prob: float, rnd: RandomState, unknown_label: str | None = None, opts: BpeOpts)[source]¶
Will randomly sample from any possible BPE split.
- Parameters:
labels – vocab
breadth_prob – 1.0 will lead to breadth-first search, 0.0 to depth-first search. other values are stochastic.
rnd
unknown_label
opts