`returnn.datasets.text_dict`¶

TextDictDataset

class returnn.datasets.text_dict.TextDictDataset(*, filename: str, item_format: str = 'list_with_scores', vocab: Vocabulary | Dict[str, Any], **kwargs)[source]¶

This dataset can read files in the format as usually generated from RETURNN search, i.e. with beam like (item_format = “list_with_scores”):

{
    seq_tag: [(score1, text1), (score2, text2), ...],
    ...
}

Or without beam like (item_format = “single”):

{
    seq_tag: text,
    ...
}

The data keys:

data: The single (or best) sequence (encoded via vocab). data_flat: for list_with_scores, all sequences concatenated (encoded via vocab), in the given order data_seq_lens: for list_with_scores, the sequence lengths of each seq in data_flat scores: for list_with_scores, the scores of each seq in data_flat

Parameters:

filename – text dict file. can be gzipped.
item_format – “list_with_scores” or “single”
vocab – to encode the text as a label sequence. See Vocabulary.create_vocab.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶: init seq order

supports_sharding() → bool[source]¶

Returns:: whether this dataset supports sharding

supports_seq_order_sorting() → bool[source]¶: supports sorting

get_current_seq_order() → Sequence[int][source]¶

Returns:: seq order

have_corpus_seq_idx() → bool[source]¶

Returns:: whether we can use get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx: int) → int[source]¶

Parameters:: seq_idx

get_tag(seq_idx: int) → str[source]¶

Parameters:: seq_idx
Returns:: seq tag

get_all_tags() → List[str][source]¶

Returns:: all tags

get_total_num_seqs(*, fast: bool = False) → int[source]¶

Returns:: total num seqs in dataset (not for (sub)epoch)

get_data_dim(key: str) → int[source]¶

Returns:: dim of data entry with key

get_data_dtype(key: str) → str[source]¶

Returns:: dtype of data entry with key

get_data_keys() → List[str][source]¶

Returns:: available data keys

get_data_shape(key: str) → List[str][source]¶: :returns get_data(*, key).shape[1:], i.e. num-frames excluded

is_data_sparse(key: str) → bool[source]¶

Returns:: whether data entry with key is sparse

returnn.datasets.text_dict¶

`returnn.datasets.text_dict`¶