returnn.datasets.cached2

Provides CachedDataset2.

class returnn.datasets.cached2.CachedDataset2(**kwargs)[source]

Somewhat like CachedDataset, but different. Simpler in some sense. And more generic. Caching might be worse.

If you derive from this class: - you must override _collect_single_seq - you must set num_inputs (dense-dim of “data” key) and num_outputs (dict key -> dim, ndim-1) - you should set labels - handle seq ordering by overriding init_seq_order - you can set _estimated_num_seqs - you can set _num_seqs or _num_timesteps if you know them in advance

Parameters:
  • name (str) – e.g. “train” or “eval”

  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.

  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk

  • chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”

  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.

  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.

  • random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.

  • partition_epoch (int|None)

  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.

  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use

  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order

  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file

  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported

  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int|None)

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order. Only possible if the dataset has such indices (see self.have_corpus_seq_idx()).

Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

is_cached(start, end)[source]
Parameters:
  • start (int)

  • end (int)

Return type:

bool

property num_seqs[source]
Return type:

int

is_less_than_num_seqs(n)[source]
Parameters:

n (int)

Return type:

int

get_num_timesteps()[source]
Return type:

int

get_seq_length(sorted_seq_idx)[source]
Return type:

returnn.util.NumbersDict

get_data(seq_idx, key)[source]
Parameters:
  • seq_idx (int)

  • key (str)

Return type:

numpy.ndarray

get_input_data(seq_idx)[source]
Parameters:

seq_idx (int)

Return type:

numpy.ndarray

get_targets(target, seq_idx)[source]
Parameters:
  • target (str)

  • seq_idx (int)

Return type:

numpy.ndarray

get_tag(sorted_seq_idx)[source]
Parameters:

sorted_seq_idx (int)

Return type:

str

get_data_keys()[source]
Return type:

list[str]

get_target_list()[source]

Target data keys are usually not available during inference. Overwrite this if your dataset is more custom.

is_data_sparse(key)[source]
Parameters:

key (str) – e.g. “data” or “classes”

Return type:

bool

get_data_dim(key)[source]
Parameters:

key (str) – e.g. “data” or “classes”

Return type:

int

Returns:

number of classes, no matter if sparse or not

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

class returnn.datasets.cached2.SingleStreamPipeDataset(dim, ndim, sparse=False, dtype='float32')[source]

Producer: Gets data from somewhere / an external source, running in some thread. Consumer: The thread / code which calls load_seqs and get_data here.

Parameters:
  • dim (int)

  • ndim (int)

  • sparse (bool)

  • dtype (str)

is_data_sparse(key)[source]
Parameters:

key (str)

Return type:

bool

get_data_dtype(key)[source]
Parameters:

key (str)

Return type:

str

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]
Parameters:
  • epoch (int)

  • seq_list (list[str]|None)

  • seq_order (list[int]|None)

Return type:

bool

producer_add_data(data, seq_tag=None)[source]
Parameters:
  • data (numpy.ndarray)

  • seq_tag (str|None)

producer_set_finished()[source]

Mark finished.