class CachedDataset.CachedDataset(cache_byte_size=0, **kwargs)[source]
Parameters:cache_byte_size (int) –

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
Initialize lists:
self.seq_index # sorted seq idx
Returns:many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.
Return type:list[int]

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:whether we should enable this cache
Return type:bool
load_seqs(self, start, end)[source]

Load data sequences. As a side effect, will modify / fill-up:

self.alloc_intervals self.targets

This does some extra logic for the cache and calls self._load_seqs() for the real loading.

  • start (int) – start sorted seq idx
  • end (int) – end sorted seq idx
alloc_interval_index(self, ids)[source]
Parameters:ids (int) – sorted seq idx

:return index in self.alloc_intervals :rtype: int

insert_alloc_interval(self, start, end=None)[source]
remove_alloc_interval(self, start, end=None)[source]
delete(self, nframes)[source]
Parameters:nframes (int|None) – how much frames to delete max. Note that this limit is not strict. We can end up deleting more than nframes.
Returns:number of frames deleted
Return type:int
Return type:int
is_cached(self, start, end, blocking=False)[source]
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:


:returns whether we have the full range (start,end) of sorted seq idx
cached in self.alloc_intervals (end is exclusive).
get_seq_length_nd(self, sorted_seq_idx)[source]
Return type:numpy.ndarray
get_seq_length(self, seq_idx)[source]
Return type:NumbersDict
get_seq_start(self, sorted_seq_idx)[source]
Return type:(int,int)
get_times(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
get_input_data(self, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_data_dim(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:number of classes, no matter if sparse or not
Return type:int
get_targets(self, target, sorted_seq_idx)[source]
Parameters:target (str) – data key
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
Returns:subset of get_data_keys(). target keys are usually not available during inference
Return type:list[str]
get_ctc_targets(self, sorted_seq_idx)[source]

Warning: This is deprecated/obsolete.

Parameters:sorted_seq_idx (int) –
Return type:numpy.ndarray|None
Returns:whether we have get_ctc_targets implemented
Return type:bool
get_tag(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
Return type:bool
Returns:whether you can call self.get_corpus_seq_idx()
get_corpus_seq_idx(self, seq_idx)[source]
Parameters:seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:the sequence index as-is in the original corpus. only defined if self.have_corpus_seq_idx()
Return type:int