returnn.datasets.numpy_dump#

Provides NumpyDumpDataset.

class returnn.datasets.numpy_dump.NumpyDumpDataset(prefix, postfix='.txt.gz', start_seq=0, end_seq=None, num_inputs=None, num_outputs=None, **kwargs)[source]#

For tools/dump-dataset.py --type=numpy.

Parameters:
  • name (str) – e.g. “train” or “eval”

  • window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.

  • context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk

  • chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”

  • seq_ordering (str) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.

  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.

  • random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.

  • partition_epoch (int|None) –

  • repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.

  • seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use

  • unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order

  • seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file

  • shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported

  • estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown

file_format_data = '%i.data'[source]#
file_format_targets = '%i.targets'[source]#
init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) –

  • seq_list (list[str]|None) –

  • seq_order (list[int]|None) –

Return type:

bool

get_input_data(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

numpy.ndarray

get_targets(target, seq_idx)[source]#
Parameters:
  • target (str) –

  • seq_idx (int) –

Return type:

numpy.ndarray

get_seq_length(seq_idx)[source]#
Parameters:

seq_idx (int) –

Return type:

Util.NumbersDict

property num_seqs[source]#
Return type:

int

len_info()[source]#
Return type:

str