returnn.util.bpe
#
Provide basic Byte-Pair-Encoding (BPE) utilities.
- class returnn.util.bpe.StandardBytePairEncoder(bpe_codes_file, labels=None)[source]#
Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.
Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.
Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.
- Parameters:
bpe_codes_file (str) – codes file
labels (list[str]|None) – vocab
- class returnn.util.bpe.PrefixTree(prefix='', root=None)[source]#
Prefix tree / trie. This class represents both a single node and the tree.
- Parameters:
prefix (str) –
root (PrefixTree|None) –
- add(postfix, root=None)[source]#
- Parameters:
postfix (str) –
root (None|PrefixTree) –
- Return type:
- class returnn.util.bpe.Hyp(bpe_sym_history, cur_node)[source]#
Represents a hypothesis in the search.
- Parameters:
bpe_sym_history (list[str]) –
cur_node (PrefixTree) –
- class returnn.util.bpe.CharSyncSearch(bpe, word, word_pos=0)[source]#
Covers the search hyps and the search itself.
- Parameters:
bpe (PrefixTree) –
word (str) –
word_pos (int) –
- class returnn.util.bpe.HypInPos(bpe_sym_history, cur_node, pos)[source]#
Represents a hypothesis in the search.
- Parameters:
bpe_sym_history (list[str]) –
cur_node (PrefixTree) –
pos (int) –
- class returnn.util.bpe.DepthFirstSearch(bpe, word, sampler=None)[source]#
Depth-first search.
- Parameters:
bpe (PrefixTree) –
word (str) –
sampler ((()->bool)|None) –
- class returnn.util.bpe.SamplingBytePairEncoder(labels, breadth_prob, rnd, unknown_label=None)[source]#
Will randomly sample from any possible BPE split.
- Parameters:
labels (list[str]) – vocab
breadth_prob (float) – 1.0 will lead to breadth-first search, 0.0 to depth-first search. other values are stochastic.
rnd (numpy.random.RandomState) –
unknown_label (str|None) –