returnn.datasets.text_dict

TextDictDataset

class returnn.datasets.text_dict.TextDictDataset(*, filename: str, item_format: str = 'list_with_scores', vocab: Vocabulary | Dict[str, Any], **kwargs)[source]

This dataset can read files in the format as usually generated from RETURNN search, i.e. with beam like (item_format = “list_with_scores”):

{
    seq_tag: [(score1, text1), (score2, text2), ...],
    ...
}

Or without beam like (item_format = “single”):

{
    seq_tag: text,
    ...
}

The data keys:

data: The single (or best) sequence (encoded via vocab). data_flat: for list_with_scores, all sequences concatenated (encoded via vocab), in the given order data_seq_lens: for list_with_scores, the sequence lengths of each seq in data_flat scores: for list_with_scores, the scores of each seq in data_flat

Parameters:
  • filename – text dict file. can be gzipped.

  • item_format – “list_with_scores” or “single”

  • vocab – to encode the text as a label sequence. See Vocabulary.create_vocab.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]

init seq order

supports_sharding() bool[source]
Returns:

whether this dataset supports sharding

supports_seq_order_sorting() bool[source]

supports sorting

get_current_seq_order() Sequence[int][source]
Returns:

seq order

have_corpus_seq_idx() bool[source]
Returns:

whether we can use get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx: int) int[source]
Parameters:

seq_idx

get_tag(seq_idx: int) str[source]
Parameters:

seq_idx

Returns:

seq tag

get_all_tags() List[str][source]
Returns:

all tags

get_total_num_seqs(*, fast: bool = False) int[source]
Returns:

total num seqs in dataset (not for (sub)epoch)

get_data_dim(key: str) int[source]
Returns:

dim of data entry with key

get_data_dtype(key: str) str[source]
Returns:

dtype of data entry with key

get_data_keys() List[str][source]
Returns:

available data keys

get_data_shape(key: str) List[str][source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded

is_data_sparse(key: str) bool[source]
Returns:

whether data entry with key is sparse