`returnn.datasets.hdf`¶

Provides HDFDataset.

class returnn.datasets.hdf.HDFDataset(files=None, use_cache_manager=False, **kwargs)[source]¶

Dataset based on HDF files. This was the main original dataset format of RETURNN.

Parameters:

files (None|list[str])
use_cache_manager (bool) – uses Util.cf() for files

add_file(filename)[source]¶

Setups data:: self.file_start self.file_seq_start

Use load_seqs() to load the actual data. :type filename: str

get_data(seq_idx, key)[source]¶

Parameters:

seq_idx (int)
key (str)

Return type:

numpy.ndarray

get_data_by_seq_tag(seq_tag, key)[source]¶

Parameters:

seq_tag (str)
key (str)

Return type:

numpy.ndarray

get_input_data(sorted_seq_idx)[source]¶

Parameters:: sorted_seq_idx (int)
Return type:: numpy.ndarray

get_targets(target, sorted_seq_idx)[source]¶

Parameters:

target (str)
sorted_seq_idx (int)

Return type:

numpy.ndarray

get_estimated_seq_length(seq_idx)[source]¶

Parameters:: seq_idx (int) – for current epoch, not the corpus seq idx
Return type:: int

:returns sequence length of “data”, used for sequence sorting

get_tag(sorted_seq_idx)[source]¶

Parameters:: sorted_seq_idx (int)
Return type:: str

have_get_corpus_seq() → bool[source]¶

Returns:: whether this dataset supports get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) → DatasetSeq[source]¶

Parameters:: corpus_seq_idx (int) – corpus seq idx
Returns:: the seq with the given corpus seq idx
Return type:: DatasetSeq

get_all_tags()[source]¶

Return type:: list[str]

get_total_num_seqs(*, fast: bool = False) → int[source]¶

Return type:: int

is_data_sparse(key)[source]¶

Parameters:: key (str)
Return type:: bool

get_data_dtype(key)[source]¶

Parameters:: key (str)
Return type:: str

class returnn.datasets.hdf.StreamParser(seq_names, stream)[source]¶

Stream parser.

get_data(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: numpy.ndarray

get_seq_length(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: int

get_dtype()[source]¶

Return type:: str

class returnn.datasets.hdf.FeatureSequenceStreamParser(*args, **kwargs)[source]¶

Feature sequence stream parser.

get_data(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: numpy.ndarray

get_seq_length(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: int

class returnn.datasets.hdf.SparseStreamParser(*args, **kwargs)[source]¶

Sparse stream parser.

get_data(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: numpy.ndarray

get_seq_length(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: int

class returnn.datasets.hdf.SegmentAlignmentStreamParser(*args, **kwargs)[source]¶

Segment alignment stream parser.

get_data(seq_name)[source]¶

Parameters:: seq_name (str)
Returns:: flatted two-dimensional data where the 2nd dimension is 2 [class, segment end]
Return type:: numpy.ndarray

get_seq_length(seq_name)[source]¶

Parameters:: seq_name (str)
Return type:: int

class returnn.datasets.hdf.NextGenHDFDataset(input_stream_name, files=None, **kwargs)[source]¶

Another separate dataset which uses HDF files to store the data.

Parameters:

input_stream_name (str)
files (None|list[str])

parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]¶

add_file(path)[source]¶

Parameters:: path (str)

initialize()[source]¶: Initialization.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

Parameters:

seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

supports_seq_order_sorting() → bool[source]¶: supports sorting

supports_sharding() → bool[source]¶

Returns:: whether this dataset supports sharding

get_data_dtype(key)[source]¶

Parameters:: key (str) – e.g. “data”
Return type:: str

class returnn.datasets.hdf.SiameseHDFDataset(input_stream_name, seq_label_stream='words', class_distribution=None, files=None, **kwargs)[source]¶

SiameseHDFDataset class allows to do sequence sampling for weakly-supervised training. It accepts data in the format of NextGenHDFDataset and performs sampling of sequence triplets before each epoch. Triplets are tuples of the format: (anchor seq, random seq with the same label, random seq with a different label) Here we assume that each dataset from the input .hdf has a single label. In the config we can access streams by e.g. [”data:features_0”], [”data:features_1”], [”data:features_2”]. Split names depend on stream names in the input data, e.g. “features”, “data”, “classes”, etc. SiameseHDFDataset method _collect_single_seq(self, seq_idx) returns a DatasetSeq with extended dictionary of targets. “data:features_0” key stands for features of anchor sequences from the input data. In NexGenHDFDataset it would correspond to “data:features” or “data”. “data:features_1” is a key, which denote a pair of “data:features_0”. For each anchor sequence SiameseHDFDataset randomly samples a sequence with the same label. “data:features_2” denotes the third element in a triplet tuple. For each anchor sequence SiameseHDFDataset randomly samples a sequence with a different label. Targets are splitted into different streams as well, e.g. “data:classes_0”, “data:classes_1”, “data:classes_2”.

SiameseHDFDataset also supports non-uniform sampling and accepts a path to .npz matrix. Rows of this matrix should have probabilities for each of the classes to be sampled. This probability distribution might reflect class similarities.

This dataset might be useful for metric learning, where we want to learn such representations of input sequences, that those which belong to the same class are close together, while those with different labels should have representations far away from each other.

Parameters:

input_stream_name (str) – name of a feature stream
seq_label_stream (str) – name of a stream with labels
class_distribution (str) – path to .npz file of size n x n (n is a number of classes), where each line i contains probs of other classes to be picked in triplets when sampling a pair for element from class i
files (list[str]) – list of paths to .hdf files

parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]¶

add_file(path)[source]¶

Parameters:: path (str) – path to single .hdf file

initialize()[source]¶: initialize target_to_seqs and seq_to_target dicts

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶

Parameters:

epoch (int|None) – current epoch id
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

is_data_sparse(key)[source]¶

Parameters:: key (str) – e.g. “features_0” or “orth_features_0” or “words_0”
Returns:: whether the data is sparse
Return type:: bool

get_data_dim(key)[source]¶

Parameters:: key (str) – e.g. “features_0”, “features_1”, “classes_0”, etc.
Returns:: number of classes, no matter if sparse or not
Return type:: int

class returnn.datasets.hdf.SimpleHDFWriter(filename, dim, labels=None, ndim=None, extra_type=None, swmr=False, extend_existing_file=False, extra_labels=None)[source]¶

Intended for a simple interface, to dump data on-the-fly into a HDF file, which can be read later by HDFDataset.

Note that we dump to a temp file first, and only at close() we move it over to the real destination.

Can be used as a context manager, i.e. with the with statement.

Parameters:

filename (str) – Create file, truncate if exists
dim (int|None)
ndim (int) – counted without batch
labels (list[str]|None)
extra_type (dict[str,(int,int,str)]|None) – key -> (dim,ndim,dtype)
swmr (bool) – see https://docs.h5py.org/en/stable/swmr.html
extend_existing_file (bool) – True also means we expect that it exists
extra_labels (dict[str,list[str]]|None) – key -> labels

insert_batch(inputs, seq_len, seq_tag, extra=None)[source]¶

Parameters:

inputs (numpy.ndarray) – shape=(n_batch,time,data) (or (n_batch,time), or (n_batch,time1,time2), …)
seq_len (list[int]|dict[int,list[int]|numpy.ndarray]) – sequence lengths (per axis, excluding batch axis)
seq_tag (list[str|bytes]) – sequence tags of length n_batch
extra (dict[str,numpy.ndarray]|None) – one or multiple possible targets data. key can be “classes” or anything. The dtype and dim is inferred automatically from the Numpy array. If there are multiple items, the seq length must be the same currently. Must be batch-major, and following the time, then the feature.

close()[source]¶: Closes the file.

class returnn.datasets.hdf.HDFDatasetWriter(filename)[source]¶

Similar as SimpleHDFWriter, but is mostly intended to copy an existing dataset, see dump_from_dataset(). The resulting HDF file can be read later by HDFDataset.

Parameters:: filename (str) – for the HDF to write

close()[source]¶: Close the HDF file.

dump_from_dataset(dataset, epoch=1, start_seq=0, end_seq=inf, use_progress_bar=True)[source]¶

Parameters:

dataset (Dataset) – could be any dataset implemented as child of Dataset
epoch (int) – for dataset
start_seq (int)
end_seq (int|float)
use_progress_bar (bool)

returnn.datasets.hdf¶

`returnn.datasets.hdf`¶