Dataset

class Dataset.Dataset(name='dataset', window=1, context_window=None, chunking=None, seq_ordering='default', shuffle_frames_of_nseqs=0, min_chunk_size=0, estimated_num_seqs=None)[source]
Parameters:
  • name (str) – e.g. “train” or “eval”
  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
  • context_window (None|int|dict|NumbersDict) – will add this context for each chunk
  • chunking (None|str|int|(int,int)|dict|(dict,dict)) – “chunk_size:chunk_step”
  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
static kwargs_update_from_config(config, kwargs)[source]
classmethod from_config(config, **kwargs)[source]
Parameters:kwargs (dict[str]) – passed on to __init__
Return type:Dataset
num_outputs = None[source]
Type:dict[str,(int,int)]
labels = None[source]
Type:dict[str,list[str]]
sliding_window(xr)[source]
Return type:numpy.ndarray
preprocess(seq)[source]
Return type:numpy.ndarray
is_cached(start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

get_seq_length_2d(sorted_seq_idx)[source]
Return type:numpy.array[int,int]

:returns the len of the input features and the len of the target sequence. For multiple target seqs, it is expected that they all have the same len. We support different input/target len for seq2seq/ctc and other models. Note: This is deprecated, better use get_seq_length().

get_seq_length(seq_idx)[source]
Return type:NumbersDict
get_num_timesteps()[source]
get_num_codesteps()[source]
load_seqs(start, end)[source]

Load data sequences, such that self.get_data() & friends can return the data. :param int start: start sorted seq idx, inclusive :param int end: end sorted seq idx, exclusive

get_seq_order_for_epoch(epoch, num_seqs, get_seq_len=None)[source]

Returns the order of the given epoch. This is mostly a static method, except that is depends on the configured type of ordering, such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.

Parameters:
  • epoch (int) – for ‘random’, this determines the random seed
  • num_seqs (int) –
  • -> int)|None get_seq_len (((int)) – function (originalSeqIdx: int) -> int
Returns:

the order for the given epoch. such that seq_idx -> underlying idx

Return type:

list[int]

init_seq_order(epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
Return type:bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

initialize()[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

get_times(sorted_seq_idx)[source]
get_data(seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_input_data(sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_targets(target, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_ctc_targets(sorted_seq_idx)[source]
get_data_slice(seq_idx, key, start_frame, end_frame)[source]
get_tag(sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
have_corpus_seq_idx()[source]
Return type:bool
Returns:whether you can call self.get_corpus_seq_idx()
get_corpus_seq_idx(seq_idx)[source]
Parameters:seq_idx (int) – sorted sequence index from the current epoch, depending on seq_ordering
Returns:the sequence index as-is in the original corpus. only defined if self.have_corpus_seq_idx()
Return type:int
has_ctc_targets()[source]
get_max_ctc_length()[source]
classmethod generic_complete_frac(seq_idx, num_seqs)[source]
Parameters:
  • seq_idx (int) – idx
  • num_seqs (int|None) – None if not available
Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

get_complete_frac(seq_idx)[source]
Returns:Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
num_seqs[source]
estimated_num_seqs[source]
get_data_keys()[source]
get_target_list()[source]
get_data_dim(key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:number of classes, no matter if sparse or not
Return type:int
get_data_dtype(key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:dtype as str, e.g. “int32” or “float32”
Return type:str
is_data_sparse(key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:whether the data is sparse
Return type:bool
get_data_shape(key)[source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int]

have_seqs()[source]
Returns:whether num_seqs > 0
Return type:bool
len_info()[source]
Return type:str

:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.

is_less_than_num_seqs(n)[source]
Return type:bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

can_serialize_data(key)[source]
Parameters:key (str) – e.g. “classes”
Return type:bool
serialize_data(key, data)[source]
Parameters:
  • key (str) – e.g. “classes”. self.labels[key] should be set
  • data (numpy.ndarray) – 1D
calculate_priori(target='classes')[source]
iterate_seqs(chunk_size=None, chunk_step=None, used_data_keys=None)[source]

Takes chunking into consideration. :param int|NumbersDict chunk_size: :param int|NumbersDict chunk_step: :param set(str)|None used_data_keys: :return: generator which yields tuples (seq index, seq start, seq end) :rtype: list[(int,NumbersDict,NumbersDict)]

get_start_end_frames_full_seq(seq_idx)[source]
Parameters:seq_idx (int) –
Returns:(start,end) frame, taking context_window into account
Return type:(NumbersDict,NumbersDict)
batch_set_generator_cache_whole_epoch()[source]

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:whether we should enable this cache
Return type:bool
generate_batches(shuffle_batches=False, **kwargs)[source]
Parameters:
  • shuffle_batches (bool) –
  • kwargs – will be passed to _generate_batches()
Return type:

BatchSetGenerator

classmethod index_shape_for_batches(batches, data_key='data')[source]
class Dataset.DatasetSeq(seq_idx, features, targets=None, ctc_targets=None, seq_tag=None)[source]
Parameters:
  • seq_idx (int) – sorted seq idx in the Dataset
  • features (numpy.ndarray|dict[str,numpy.ndarray]) – format 2d (time,feature) (float)
  • targets (dict[str,numpy.ndarray]|numpy.ndarray|None) – name -> format 1d (time) (idx of output-feature)
  • ctc_targets (numpy.ndarray|None) – format 1d (time) (idx of output-feature)
  • seq_tag (str) – sequence name / tag
num_frames[source]
Return type:NumbersDict
get_data(key)[source]
get_data_keys()[source]
Dataset.get_dataset_class(name)[source]
Dataset.init_dataset(kwargs)[source]
Parameters:kwargs (dict[str]|str|(()->dict[str])) –
Return type:Dataset
Dataset.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]
Parameters:
  • config_str (str) – hdf-files, or “LmDataset:…” or so
  • config (Config.Config|None) – optional, only for “sprint:…”
  • cache_byte_size (int|None) – optional, only for HDFDataset
Return type:

Dataset

Dataset.convert_data_dims(data_dims, leave_dict_as_is=False)[source]

This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. In all cases, the output will be a new copy of the dict.

Parameters:
  • data_dims (int|dict[str,int|(int,int)|dict]) – what we called num_outputs originally
  • leave_dict_as_is (bool) –
Return type:

dict[str,(int,int)|dict]

:returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))
(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)
Dataset.random() → x in the interval [0, 1).[source]
Dataset.shapes_for_batches(batches, data_keys, dataset=None, extern_data=None, enforce_min_len1=False)[source]
Parameters:
Return type:

dict[str,list[int]] | None

Dataset.set_config_num_inputs_outputs_from_dataset(config, dataset)[source]
Parameters: