returnn.datasets.numpy_dump
¶
Provides NumpyDumpDataset
.
- class returnn.datasets.numpy_dump.NumpyDumpDataset(prefix, postfix='.txt.gz', start_seq=0, end_seq=None, num_inputs=None, num_outputs=None, **kwargs)[source]¶
For
tools/dump-dataset.py --type=numpy
.- Parameters:
name (str) – e.g. “train” or “eval”
window (int) – features will be of dimension window * feature_dim, as we add a context-window around. not all datasets support this option.
context_window (None|int|dict|NumbersDict|(dict,dict)) – will add this context for each chunk
chunking (None|str|int|(int,int)|dict|(dict,dict)|function) – “chunk_size:chunk_step”
seq_ordering (str|function) – “batching”-option in config. e.g. “default”, “sorted” or “random”. See self.get_seq_order_for_epoch() for more details.
fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used. useful when used as eval dataset.
random_seed_offset (int|None) – for shuffling, e.g. for seq_ordering=’random’. ignored when fixed_random_seed is set.
partition_epoch (int|None)
repeat_epoch (int|None) – Repeat the sequences in an epoch this many times. Useful to scale the dataset relative to other datasets, e.g. when used in CombinedDataset. Not allowed to be used in combination with partition_epoch.
seq_list_filter_file (str|None) – defines a subset of sequences (by tag) to use
unique_seq_tags (bool) – uniquify seqs with same seq tags in seq order
seq_order_seq_lens_file (str|None) – for seq order, use the seq length given by this file
shuffle_frames_of_nseqs (int) – shuffles the frames. not always supported
estimated_num_seqs (None|int) – for progress reporting in case the real num_seqs is unknown
_num_shards (int) – number of shards the data is split into
_shard_index (int) – local shard index, when sharding is enabled
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
epoch (int|None)
seq_list (list[str]|None)
seq_order (list[int]|None)
- Return type:
bool