Dataset

This defines the base dataset class Dataset.

class Dataset.Dataset(name=None, window=1, context_window=None, chunking=None, seq_ordering='default', partition_epoch=None, repeat_epoch=None, seq_list_filter_file=None, unique_seq_tags=False, seq_order_seq_lens_file=None, shuffle_frames_of_nseqs=0, min_chunk_size=0, chunking_variance=0, estimated_num_seqs=None)[source]

Base class for any dataset. This defines the dataset API.

Parameters:
  • name (str) – e.g. “train” or “eval”
  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
  • chunking (None|str|int|(int,int)|dict|(dict,dict)) – “chunk_size:chunk_step”
  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
  • partition_epoch (int|None) –
  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
static kwargs_update_from_config(config, kwargs)[source]
static get_default_kwargs_eval(config)[source]
Parameters:config (Config.Config) –
Return type:dict[str]
classmethod from_config(config, **kwargs)[source]
Parameters:kwargs (dict[str]) – passed on to __init__
Return type:Dataset
sliding_window(self, xr)[source]
Return type:numpy.ndarray
preprocess(self, seq)[source]
Return type:numpy.ndarray
is_cached(self, start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

get_seq_length_nd(self, sorted_seq_idx)[source]
Return type:numpy.ndarray

:returns the len of the input features and the len of each target sequence. Note: This is deprecated, better use get_seq_length(). Attention: Either this method or get_seq_length() needs to be redefined in any subclass of Dataset! However, in new code, just override get_seq_length().

get_seq_length(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:NumbersDict

:returns the len of the input features and the len of the target sequence.

get_num_timesteps(self)[source]
Return type:int
get_num_codesteps(self)[source]
Return type:int|list[int]
load_seqs(self, start, end)[source]

Load data sequences, such that self.get_data() & friends can return the data.

Parameters:
  • start (int) – start sorted seq idx, inclusive
  • end (int) – end sorted seq idx, exclusive
get_seq_order_for_epoch(self, epoch, num_seqs, get_seq_len=None)[source]

Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.

Parameters:
  • epoch (int) – for ‘random’, this determines the random seed
  • num_seqs (int) –
  • -> int)|None get_seq_len (((int)) – function (originalSeqIdx: int) -> int
Returns:

the order for the given epoch. such that seq_idx -> underlying idx

Return type:

list[int]

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
Return type:bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

finish_epoch(self)[source]

This would get called at the end of the epoch (currently optional only). After this, further calls to get_data() or load_seqs() are invalid, until a new call to init_seq_order() follows.

get_current_seq_order(self)[source]
Returns:many datasets use self.get_seq_order_for_epoch. this function would return the current seq order for the current epoch, after self.init_seq_order was called. Not all datasets implement this.
Return type:list[int]
initialize(self)[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

get_times(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
get_data(self, seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_input_data(self, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_targets(self, target, sorted_seq_idx)[source]
Parameters:target (str) – data key
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_ctc_targets(self, sorted_seq_idx)[source]

Warning: This is deprecated/obsolete.

Parameters:sorted_seq_idx (int) –
Return type:numpy.ndarray|None
get_data_slice(self, seq_idx, key, start_frame, end_frame)[source]
Parameters:
  • seq_idx (int) –
  • key (str) –
  • start_frame (int) –
  • end_frame (int) –
Returns:

x[start_frame:end_frame], with x = get_data(seq_idx, key)

Return type:

numpy.ndarray

get_tag(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
get_all_tags(self)[source]
Returns:list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.
Return type:list[str]
get_total_num_seqs(self)[source]
Returns:total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.
Return type:int
have_corpus_seq_idx(self)[source]
Return type:bool
Returns:whether you can call self.get_corpus_seq_idx()
get_corpus_seq_idx(self, seq_idx)[source]
Parameters:seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:the sequence index as-is in the original corpus (as if you would have sorting=”default”). only defined if self.have_corpus_seq_idx()
Return type:int
has_ctc_targets(self)[source]
Returns:whether we have get_ctc_targets implemented
Return type:bool
get_max_ctc_length(self)[source]
Return type:int
classmethod generic_complete_frac(seq_idx, num_seqs)[source]
Parameters:
  • seq_idx (int) – idx
  • num_seqs (int|None) – None if not available
Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

get_complete_frac(self, seq_idx)[source]
Parameters:seq_idx (int) –
Returns:Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
Return type:float
num_seqs[source]
Return type:int
estimated_num_seqs[source]
Returns:estimated num seqs. does not have to be exact
Return type:int|None
get_data_keys(self)[source]
Returns:all available data keys (for get_data and all other functions)
Return type:list[str]
get_target_list(self)[source]
Returns:subset of get_data_keys(). target keys are usually not available during inference
Return type:list[str]
get_data_dim(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:number of classes, no matter if sparse or not
Return type:int
get_data_dtype(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:dtype as str, e.g. “int32” or “float32”
Return type:str
is_data_sparse(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:whether the data is sparse
Return type:bool
get_data_shape(self, key)[source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int]

have_seqs(self)[source]
Returns:whether num_seqs > 0
Return type:bool
len_info(self)[source]
Return type:str

:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.

is_less_than_num_seqs(self, n)[source]
Return type:bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

can_serialize_data(self, key)[source]
Parameters:key (str) – e.g. “classes”
Return type:bool
serialize_data(self, key, data)[source]
Parameters:
  • key (str) – e.g. “classes”. self.labels[key] should be set
  • data (numpy.ndarray) – 1D
Return type:

str

calculate_priori(self, target='classes')[source]
Parameters:target (str) –
Return type:numpy.ndarray
iterate_seqs(self, chunk_size=None, chunk_step=None, used_data_keys=None)[source]

Takes chunking into consideration. :param int|NumbersDict chunk_size: :param int|NumbersDict chunk_step: :param set(str)|None used_data_keys: :return: generator which yields tuples (seq index, seq start, seq end) :rtype: list[(int,NumbersDict,NumbersDict)]

get_start_end_frames_full_seq(self, seq_idx)[source]
Parameters:seq_idx (int) –
Returns:(start,end) frame, taking context_window into account
Return type:(NumbersDict,NumbersDict)
sample(self, seq_idx)[source]
Parameters:seq_idx (int) –
Return type:bool
update_weights(self, seqs, weights)[source]
Parameters:
batch_set_generator_cache_whole_epoch(self)[source]

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:whether we should enable this cache
Return type:bool
generate_batches(self, shuffle_batches=False, **kwargs)[source]
Parameters:
  • shuffle_batches (bool) –
  • kwargs – will be passed to _generate_batches()
Return type:

BatchSetGenerator

classmethod index_shape_for_batches(batches, data_key='data')[source]
Parameters:
Returns:

shape as (time, batch)

Return type:

(int, int)

class Dataset.DatasetSeq(seq_idx, features, targets=None, ctc_targets=None, seq_tag=None)[source]

Encapsulates all data for one sequence.

Parameters:
  • seq_idx (int) – sorted seq idx in the Dataset
  • features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)
  • targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)
  • ctc_targets (numpy.ndarray|None) – format 1d (time) (idx of output-feature)
  • seq_tag (str) – sequence name / tag
num_frames[source]
Return type:NumbersDict
get_data(self, key)[source]
Parameters:key (str) –
Return type:numpy.ndarray
get_data_keys(self)[source]
Return type:set[str]
Dataset.get_dataset_class(name)[source]
Parameters:name (str) –
Return type:type[Dataset]
Dataset.init_dataset(kwargs, extra_kwargs=None, default_kwargs=None)[source]
Parameters:
  • kwargs (dict[str]|str|(()->dict[str])|Dataset) –
  • extra_kwargs (dict[str]|None) –
  • default_kwargs (dict[str]|None) –
Return type:

Dataset

Dataset.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]
Parameters:
  • config_str (str) – hdf-files, or “LmDataset:…” or so
  • config (Config.Config|None) – optional, only for “sprint:…”
  • cache_byte_size (int|None) – optional, only for HDFDataset
Return type:

Dataset

Dataset.convert_data_dims(data_dims, leave_dict_as_is=False)[source]

This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.

Parameters:
  • data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally
  • leave_dict_as_is (bool) –
Return type:

dict[str,(int,int)|dict]

:returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))
(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)
Dataset.random()[source]
Dataset.shapes_for_batches(batches, data_keys, dataset=None, extern_data=None, enforce_min_len1=False)[source]
Parameters:
Return type:

dict[str,list[int]] | None

Dataset.set_config_num_inputs_outputs_from_dataset(config, dataset)[source]
Parameters: