returnn.datasets.hdf
¶
Provides HDFDataset
.
- class returnn.datasets.hdf.HDFDataset(files=None, use_cache_manager=False, **kwargs)[source]¶
Dataset based on HDF files. This was the main original dataset format of RETURNN.
- Parameters:
files (None|list[str])
use_cache_manager (bool) – uses
Util.cf()
for files
- add_file(filename)[source]¶
- Setups data:
self.file_start self.file_seq_start
Use load_seqs() to load the actual data. :type filename: str
- get_data_by_seq_tag(seq_tag, key)[source]¶
- Parameters:
seq_tag (str)
key (str)
- Return type:
numpy.ndarray
- get_targets(target, sorted_seq_idx)[source]¶
- Parameters:
target (str)
sorted_seq_idx (int)
- Return type:
numpy.ndarray
- get_estimated_seq_length(seq_idx)[source]¶
- Parameters:
seq_idx (int) – for current epoch, not the corpus seq idx
- Return type:
int
:returns sequence length of “data”, used for sequence sorting
- have_get_corpus_seq() bool [source]¶
- Returns:
whether this dataset supports
get_corpus_seq()
- get_corpus_seq(corpus_seq_idx: int) DatasetSeq [source]¶
- Parameters:
corpus_seq_idx (int) – corpus seq idx
- Returns:
the seq with the given corpus seq idx
- Return type:
- class returnn.datasets.hdf.FeatureSequenceStreamParser(*args, **kwargs)[source]¶
Feature sequence stream parser.
- class returnn.datasets.hdf.SegmentAlignmentStreamParser(*args, **kwargs)[source]¶
Segment alignment stream parser.
- class returnn.datasets.hdf.NextGenHDFDataset(input_stream_name, files=None, **kwargs)[source]¶
Another separate dataset which uses HDF files to store the data.
- Parameters:
input_stream_name (str)
files (None|list[str])
- parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]¶
- class returnn.datasets.hdf.SiameseHDFDataset(input_stream_name, seq_label_stream='words', class_distribution=None, files=None, **kwargs)[source]¶
SiameseHDFDataset class allows to do sequence sampling for weakly-supervised training. It accepts data in the format of NextGenHDFDataset and performs sampling of sequence triplets before each epoch. Triplets are tuples of the format: (anchor seq, random seq with the same label, random seq with a different label) Here we assume that each dataset from the input .hdf has a single label. In the config we can access streams by e.g. [”data:features_0”], [”data:features_1”], [”data:features_2”]. Split names depend on stream names in the input data, e.g. “features”, “data”, “classes”, etc. SiameseHDFDataset method _collect_single_seq(self, seq_idx) returns a DatasetSeq with extended dictionary of targets. “data:features_0” key stands for features of anchor sequences from the input data. In NexGenHDFDataset it would correspond to “data:features” or “data”. “data:features_1” is a key, which denote a pair of “data:features_0”. For each anchor sequence SiameseHDFDataset randomly samples a sequence with the same label. “data:features_2” denotes the third element in a triplet tuple. For each anchor sequence SiameseHDFDataset randomly samples a sequence with a different label. Targets are splitted into different streams as well, e.g. “data:classes_0”, “data:classes_1”, “data:classes_2”.
SiameseHDFDataset also supports non-uniform sampling and accepts a path to .npz matrix. Rows of this matrix should have probabilities for each of the classes to be sampled. This probability distribution might reflect class similarities.
This dataset might be useful for metric learning, where we want to learn such representations of input sequences, that those which belong to the same class are close together, while those with different labels should have representations far away from each other.
- Parameters:
input_stream_name (str) – name of a feature stream
seq_label_stream (str) – name of a stream with labels
class_distribution (str) – path to .npz file of size n x n (n is a number of classes), where each line i contains probs of other classes to be picked in triplets when sampling a pair for element from class i
files (list[str]) – list of paths to .hdf files
- parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]¶
- add_file(path)[source]¶
register input files and sequences
- Parameters:
path (str) – path to single .hdf file
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
- Parameters:
epoch (int|None) – current epoch id
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- class returnn.datasets.hdf.SimpleHDFWriter(filename, dim, labels=None, ndim=None, extra_type=None, swmr=False, extend_existing_file=False)[source]¶
Intended for a simple interface, to dump data on-the-fly into a HDF file, which can be read later by
HDFDataset
.Note that we dump to a temp file first, and only at
close()
we move it over to the real destination.- Parameters:
filename (str) – Create file, truncate if exists
dim (int|None)
ndim (int) – counted without batch
labels (list[str]|None)
extra_type (dict[str,(int,int,str)]|None) – key -> (dim,ndim,dtype)
swmr (bool) – see https://docs.h5py.org/en/stable/swmr.html
extend_existing_file (bool) – True also means we expect that it exists
- insert_batch(inputs, seq_len, seq_tag, extra=None)[source]¶
- Parameters:
inputs (numpy.ndarray) – shape=(n_batch,time,data) (or (n_batch,time), or (n_batch,time1,time2), …)
seq_len (list[int]|dict[int,list[int]|numpy.ndarray]) – sequence lengths (per axis, excluding batch axis)
seq_tag (list[str|bytes]) – sequence tags of length n_batch
extra (dict[str,numpy.ndarray]|None) – one or multiple possible targets data. key can be “classes” or anything. The dtype and dim is inferred automatically from the Numpy array. If there are multiple items, the seq length must be the same currently. Must be batch-major, and following the time, then the feature.
- class returnn.datasets.hdf.HDFDatasetWriter(filename)[source]¶
Similar as
SimpleHDFWriter
, but is mostly intended to copy an existing dataset, seedump_from_dataset()
. The resulting HDF file can be read later byHDFDataset
.- Parameters:
filename (str) – for the HDF to write