returnn.datasets.text_dict
¶
- class returnn.datasets.text_dict.TextDictDataset(*, filename: str, item_format: str = 'list_with_scores', vocab: Vocabulary | Dict[str, Any], **kwargs)[source]¶
This dataset can read files in the format as usually generated from RETURNN search, i.e. with beam like (item_format = “list_with_scores”):
{ seq_tag: [(score1, text1), (score2, text2), ...], ... }
Or without beam like (item_format = “single”):
{ seq_tag: text, ... }
The data keys:
data: The single (or best) sequence (encoded via vocab). data_flat: for list_with_scores, all sequences concatenated (encoded via vocab), in the given order data_seq_lens: for list_with_scores, the sequence lengths of each seq in data_flat scores: for list_with_scores, the scores of each seq in data_flat
- Parameters:
filename – text dict file. can be gzipped.
item_format – “list_with_scores” or “single”
vocab – to encode the text as a label sequence. See
Vocabulary.create_vocab
.
- have_corpus_seq_idx() bool [source]¶
- Returns:
whether we can use
get_corpus_seq_idx()
- get_total_num_seqs(*, fast: bool = False) int [source]¶
- Returns:
total num seqs in dataset (not for (sub)epoch)