`returnn.datasets.cached`¶

class returnn.datasets.cached.CachedDataset(cache_byte_size=0, **kwargs)[source]¶

Base class for datasets with caching. This is only used for the HDFDataset. Also see CachedDataset2.

Parameters:: cache_byte_size (int)

initialize()[source]¶: Initialization.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

Parameters:

seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Initialize lists:: self.seq_index # sorted seq idx

supports_seq_order_sorting() → bool[source]¶: supports sorting

supports_sharding() → bool[source]¶: supports sharding

get_current_seq_order()[source]¶

Returns:: many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.

batch_set_generator_cache_whole_epoch()[source]¶

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:: whether we should enable this cache
Return type:: bool

load_seqs(start, end)[source]¶

Load data sequences. As a side effect, will modify / fill-up:

self.alloc_intervals self.targets

This does some extra logic for the cache and calls self._load_seqs() for the real loading.

Parameters:

start (int) – start sorted seq idx
end (int) – end sorted seq idx

alloc_interval_index(ids)[source]¶

Parameters:: ids (int) – sorted seq idx

:return index in self.alloc_intervals :rtype: int

insert_alloc_interval(start, end=None)[source]¶

remove_alloc_interval(start, end=None)[source]¶

delete(nframes)[source]¶

Parameters:: nframes (int|None) – how much frames to delete max. Note that this limit is not strict. We can end up deleting more than nframes.
Returns:: number of frames deleted
Return type:: int

property num_seqs[source]¶

Returns:: num seqs for current epoch

is_cached(start, end, blocking=False)[source]¶

Parameters:

start (int) – like in load_seqs(), sorted seq idx
end (int) – like in load_seqs(), sorted seq idx

Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx: cached in self.alloc_intervals (end is exclusive).

get_seq_length_nd(sorted_seq_idx)[source]¶

Return type:: numpy.ndarray

get_seq_length(seq_idx)[source]¶

Return type:: NumbersDict

get_seq_start(sorted_seq_idx)[source]¶

Return type:: (int,int)

get_times(sorted_seq_idx)[source]¶

Parameters:: sorted_seq_idx (int)

get_input_data(sorted_seq_idx)[source]¶

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

This default implementation assumes that there is a “data” data key, which is not necessarily true in all cases.

Parameters:: sorted_seq_idx
Returns features:: format 2d (time,feature) (float)

get_data_dim(key)[source]¶

Parameters:: key (str) – e.g. “data” or “classes”
Returns:: number of classes, no matter if sparse or not
Return type:: int

get_targets(target, sorted_seq_idx)[source]¶

DEPRECATED: Some older classes still use this deprecated API, but any new dataset should just implement get_data(), and users also should just use get_data().

Parameters:

target – data key
sorted_seq_idx

Returns targets:

format 1d (time) (int: idx of output-feature)

get_data_keys() → List[str][source]¶: data keys

get_target_list()[source]¶

Returns:: subset of get_data_keys(). target keys are usually not available during inference
Return type:: list[str]

get_tag(sorted_seq_idx)[source]¶

Parameters:: sorted_seq_idx

have_corpus_seq_idx()[source]¶

Return type:: bool
Returns:: whether you can call self.get_corpus_seq_idx()

get_corpus_seq_idx(seq_idx)[source]¶

Parameters:: seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:: the sequence index as-is in the original corpus. only defined if self.have_corpus_seq_idx()
Return type:: int

returnn.datasets.cached¶

`returnn.datasets.cached`¶