Dataset

class Dataset.Dataset(name='dataset', window=1, context_window=None, chunking='0', seq_ordering='default', shuffle_frames_of_nseqs=0, estimated_num_seqs=None)[source]
Parameters:
  • name (str) – e.g. “train” or “eval”
  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
  • context_window (None|int|dict|NumbersDict) – will add this context for each chunk
  • chunking (str) – “chunk_size:chunk_step”
  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
batch_set_generator_cache_whole_epoch()[source]

The BatchSetGenerator can cache the list of batches which we generated across epochs. See self.generate_batches() and self._generate_batches(). In many cases, the dataset does not support this, and in that case, it is not needed to enable this cache and waste memory. Caching it together with option shuffle_batches could also mean that there will be self.load_seqs() calls with non-monotonic seq-idxs. The only dataset currently which enables this is CachedDataset and thus HDFDataset.

Returns:whether we should enable this cache
Return type:bool
calculate_priori(target='classes')[source]
estimated_num_seqs[source]
classmethod from_config(config, **kwargs)[source]
Parameters:kwargs (dict[str]) – passed on to __init__
Return type:Dataset
generate_batches(recurrent_net, batch_size, max_seqs=-1, seq_drop=0.0, max_seq_length=9223372036854775807, shuffle_batches=False, used_data_keys=None)[source]
Parameters:used_data_keys (set(str)|None) –
Return type:BatchSetGenerator
classmethod generic_complete_frac(seq_idx, num_seqs)[source]
Parameters:
  • seq_idx (int) – idx
  • num_seqs (int|None) – None if not available
Returns:

Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.

get_complete_frac(seq_idx)[source]
Returns:Returns a fraction (float in [0,1], always > 0) of how far we have advanced for this seq in the dataset. This does not have to be exact. This is only for the user.
get_ctc_targets(sorted_seq_idx)[source]
get_data(seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_data_dim(key)[source]
Returns:number of classes, no matter if sparse or not
get_data_dtype(key)[source]
get_data_keys()[source]
get_data_shape(key)[source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded

get_data_slice(seq_idx, key, start_frame, end_frame)[source]
get_input_data(sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_max_ctc_length()[source]
get_num_codesteps()[source]
get_num_timesteps()[source]
get_seq_length(seq_idx)[source]
Return type:NumbersDict
get_seq_length_2d(sorted_seq_idx)[source]
Return type:numpy.array[int,int]

:returns the len of the input features and the len of the target sequence. For multiple target seqs, it is expected that they all have the same len. We support different input/target len for seq2seq/ctc and other models. Note: This is deprecated, better use get_seq_length().

get_seq_order_for_epoch(epoch, num_seqs, get_seq_len=None)[source]

:returns the order for the given epoch. This is mostly a static method, except that is depends on the configured type of ordering,

such as ‘default’ (= as-is), ‘sorted’ or ‘random’. ‘sorted’ also uses the sequence length.
Parameters:
  • epoch (int) – for ‘random’, this determines the random seed
  • get_seq_len – function (originalSeqIdx: int) -> int
Return type:

list[int]

get_start_end_frames_full_seq(seq_idx)[source]
Parameters:seq_idx (int) –
Returns:(start,end) frame, taking context_window into account
Return type:(NumbersDict,NumbersDict)
get_tag(sorted_seq_idx)[source]
get_target_list()[source]
get_targets(target, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_times(sorted_seq_idx)[source]
has_ctc_targets()[source]
have_seqs()[source]
classmethod index_shape_for_batches(batches, data_key='data')[source]
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
Return type:bool

:returns whether the order changed

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

initialize()[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

is_cached(start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

is_data_sparse(key)[source]
is_less_than_num_seqs(n)[source]
Return type:bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

iterate_seqs(chunk_size=None, chunk_step=None, used_data_keys=None)[source]

Takes chunking into consideration. :param int chunk_size: :param int chunk_step: :param set(str)|None used_data_keys: :return: generator which yields tuples (seq index, seq start, seq end) :rtype: list[(int,NumbersDict,NumbersDict)]

static kwargs_update_from_config(config, kwargs)[source]
len_info()[source]
Return type:str

:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.

load_seqs(start, end)[source]

Load data sequences, such that self.get_data() & friends can return the data. :param int start: start sorted seq idx, inclusive :param int end: end sorted seq idx, exclusive

num_seqs[source]
preprocess(seq)[source]
Return type:numpy.ndarray
sliding_window(xr)[source]
Return type:numpy.ndarray
class Dataset.DatasetSeq(seq_idx, features, targets, ctc_targets=None, seq_tag=None)[source]
Parameters:
  • seq_idx (int) – sorted seq idx in the Dataset
  • features (numpy.ndarray) – format 2d (time,feature) (float)
  • | numpy.ndarray | None targets (dict[str,numpy.ndarray]) – name -> format 1d (time) (idx of output-feature)
  • | None ctc_targets (numpy.ndarray) – format 1d (time) (idx of output-feature)
  • seq_tag (str) – sequence name / tag
get_data(key)[source]
get_data_keys()[source]
num_frames[source]
Return type:NumbersDict
Dataset.convert_data_dims(data_dims, leave_dict_as_is=False)[source]

This converts what we called num_outputs originally, from the various formats which were allowed in the past (just an int, or dict[str,int]) into the format which we currently expect. :param int | dict[str,int|(int,int)|dict] data_dims: what we called num_outputs originally :param bool leave_dict_as_is: :rtype: dict[str,(int,int)|dict] :returns dict data-key -> (data-dimension, len(shape) (1 ==> sparse))

(or potentially data-key -> dict, if leave_dict_as_is is True; for TensorFlow)
Dataset.get_dataset_class(name)[source]
Dataset.init_dataset(kwargs)[source]
Return type:Dataset
Dataset.init_dataset_via_str(config_str, config=None, cache_byte_size=None, **kwargs)[source]
Parameters:
  • config_str (str) – hdf-files, or “LmDataset:...” or so
  • config (Config.Config|None) – optional, only for “sprint:...”
  • cache_byte_size (int|None) – optional, only for HDFDataset
Return type:

Dataset

Dataset.random() → x in the interval [0, 1).[source]
Dataset.shapes_for_batches(batches, data_keys, dataset=None, extern_data=None)[source]
Parameters:
Return type:

dict[str,list[int]] | None