returnn.datasets.cached
#
- class returnn.datasets.cached.CachedDataset(cache_byte_size=0, **kwargs)[source]#
Base class for datasets with caching. This is only used for the
HDFDataset
. Also seeCachedDataset2
.- Parameters:
cache_byte_size (int) –
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
- Parameters:
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Initialize lists:
self.seq_index # sorted seq idx
- get_current_seq_order()[source]#
- Returns:
many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.
- Return type:
Sequence[int]
- batch_set_generator_cache_whole_epoch()[source]#
The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.
- Returns:
whether we should enable this cache
- Return type:
bool
- load_seqs(start, end)[source]#
Load data sequences. As a side effect, will modify / fill-up:
self.alloc_intervals self.targets
This does some extra logic for the cache and calls self._load_seqs() for the real loading.
- Parameters:
start (int) – start sorted seq idx
end (int) – end sorted seq idx
- alloc_interval_index(ids)[source]#
- Parameters:
ids (int) – sorted seq idx
:return index in self.alloc_intervals :rtype: int
- delete(nframes)[source]#
- Parameters:
nframes (int|None) – how much frames to delete max. Note that this limit is not strict. We can end up deleting more than nframes.
- Returns:
number of frames deleted
- Return type:
int
- is_cached(start, end, blocking=False)[source]#
- Parameters:
start (int) – like in load_seqs(), sorted seq idx
end (int) – like in load_seqs(), sorted seq idx
- Return type:
bool
- :returns whether we have the full range (start,end) of sorted seq idx
cached in self.alloc_intervals (end is exclusive).
- get_input_data(sorted_seq_idx)[source]#
- Return type:
numpy.ndarray
- Returns features:
format 2d (time,feature) (float)
- get_data_dim(key)[source]#
- Parameters:
key (str) – e.g. “data” or “classes”
- Returns:
number of classes, no matter if sparse or not
- Return type:
int
- get_targets(target, sorted_seq_idx)[source]#
- Parameters:
target (str) – data key
- Return type:
numpy.ndarray
- Returns targets:
format 1d (time) (int: idx of output-feature)
- get_target_list()[source]#
- Returns:
subset of
get_data_keys()
. target keys are usually not available during inference- Return type:
list[str]