returnn.util.bpe#

Provide basic Byte-Pair-Encoding (BPE) utilities.

class returnn.util.bpe.StandardBytePairEncoder(bpe_codes_file, labels=None)[source]#

Code is partly taken from subword-nmt/apply_bpe.py. Author: Rico Sennrich, code under MIT license.

Use operations learned with learn_bpe.py to encode a new text. The text will not be smaller, but use only a fixed vocabulary, with rare words encoded as variable-length sequences of subword units.

Reference: Rico Sennrich, Barry Haddow and Alexandra Birch (2016). Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016). Berlin, Germany.

Parameters:
  • bpe_codes_file (str) – codes file

  • labels (list[str]|None) – vocab

segment_sentence(sentence)[source]#

Segment single sentence (whitespace-tokenized string) with BPE encoding.

Parameters:

sentence (str) –

Return type:

list[str]

class returnn.util.bpe.PrefixTree(prefix='', root=None)[source]#

Prefix tree / trie. This class represents both a single node and the tree.

Parameters:
add(postfix, root=None)[source]#
Parameters:
Return type:

PrefixTree

class returnn.util.bpe.Hyp(bpe_sym_history, cur_node)[source]#

Represents a hypothesis in the search.

Parameters:
  • bpe_sym_history (list[str]) –

  • cur_node (PrefixTree) –

class returnn.util.bpe.CharSyncSearch(bpe, word, word_pos=0)[source]#

Covers the search hyps and the search itself.

Parameters:
  • bpe (PrefixTree) –

  • word (str) –

  • word_pos (int) –

search()[source]#
Returns:

collection of possible BPE symbol seqs

Return type:

list[list[str]]

class returnn.util.bpe.HypInPos(bpe_sym_history, cur_node, pos)[source]#

Represents a hypothesis in the search.

Parameters:
  • bpe_sym_history (list[str]) –

  • cur_node (PrefixTree) –

  • pos (int) –

class returnn.util.bpe.DepthFirstSearch(bpe, word, sampler=None)[source]#

Depth-first search.

Parameters:
  • bpe (PrefixTree) –

  • word (str) –

  • sampler ((()->bool)|None) –

search()[source]#
Returns:

BPE symbol seq if one is found

Return type:

list[str]|None

class returnn.util.bpe.SamplingBytePairEncoder(labels, breadth_prob, rnd, unknown_label=None)[source]#

Will randomly sample from any possible BPE split.

Parameters:
  • labels (list[str]) – vocab

  • breadth_prob (float) – 1.0 will lead to breadth-first search, 0.0 to depth-first search. other values are stochastic.

  • rnd (numpy.random.RandomState) –

  • unknown_label (str|None) –

get_bpe_split_for_word(word)[source]#
Parameters:

word (str) –

Return type:

list[str]|None

segment_sentence(sentence)[source]#

Segment single sentence (whitespace-tokenized string) with BPE encoding.

Parameters:

sentence (str) –

Return type:

list[str]