returnn.datasets.hdf#

Provides HDFDataset.

class returnn.datasets.hdf.HDFDataset(files=None, use_cache_manager=False, **kwargs)[source]#

Dataset based on HDF files. This was the main original dataset format of RETURNN.

Parameters:
  • files (None|list[str]) –

  • use_cache_manager (bool) – uses Util.cf() for files

add_file(filename)[source]#
Setups data:

self.file_start self.file_seq_start

Use load_seqs() to load the actual data. :type filename: str

get_data(seq_idx, key)[source]#
Parameters:
  • seq_idx (int) –

  • key (str) –

Return type:

numpy.ndarray

get_data_by_seq_tag(seq_tag, key)[source]#
Parameters:
  • seq_tag (str) –

  • key (str) –

Return type:

numpy.ndarray

get_input_data(sorted_seq_idx)[source]#
Parameters:

sorted_seq_idx (int) –

Return type:

numpy.ndarray

get_targets(target, sorted_seq_idx)[source]#
Parameters:
  • target (str) –

  • sorted_seq_idx (int) –

Return type:

numpy.ndarray

get_estimated_seq_length(seq_idx)[source]#
Parameters:

seq_idx (int) – for current epoch, not the corpus seq idx

Return type:

int

:returns sequence length of “data”, used for sequence sorting

get_tag(sorted_seq_idx)[source]#
Parameters:

sorted_seq_idx (int) –

Return type:

str

have_get_corpus_seq() bool[source]#
Returns:

whether this dataset supports get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) DatasetSeq[source]#
Parameters:

corpus_seq_idx (int) – corpus seq idx

Returns:

the seq with the given corpus seq idx

Return type:

DatasetSeq

get_all_tags()[source]#
Return type:

list[str]

get_total_num_seqs()[source]#
Return type:

int

is_data_sparse(key)[source]#
Parameters:

key (str) –

Return type:

bool

get_data_dtype(key)[source]#
Parameters:

key (str) –

Return type:

str

len_info()[source]#
Return type:

str

class returnn.datasets.hdf.StreamParser(seq_names, stream)[source]#

Stream parser.

get_data(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

numpy.ndarray

get_seq_length(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

int

get_dtype()[source]#
Return type:

str

class returnn.datasets.hdf.FeatureSequenceStreamParser(*args, **kwargs)[source]#

Feature sequence stream parser.

get_data(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

numpy.ndarray

get_seq_length(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

int

class returnn.datasets.hdf.SparseStreamParser(*args, **kwargs)[source]#

Sparse stream parser.

get_data(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

numpy.ndarray

get_seq_length(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

int

class returnn.datasets.hdf.SegmentAlignmentStreamParser(*args, **kwargs)[source]#

Segment alignment stream parser.

get_data(seq_name)[source]#
Parameters:

seq_name (str) –

Returns:

flatted two-dimensional data where the 2nd dimension is 2 [class, segment end]

Return type:

numpy.ndarray

get_seq_length(seq_name)[source]#
Parameters:

seq_name (str) –

Return type:

int

class returnn.datasets.hdf.NextGenHDFDataset(input_stream_name, files=None, **kwargs)[source]#

Another separate dataset which uses HDF files to store the data.

Parameters:
  • input_stream_name (str) –

  • files (None|list[str]) –

parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]#
add_file(path)[source]#
Parameters:

path (str) –

initialize()[source]#

Initialization.

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

supports_seq_order_sorting() bool[source]#

supports sorting

get_data_dtype(key)[source]#
Parameters:

key (str) – e.g. “data”

Return type:

str

class returnn.datasets.hdf.SiameseHDFDataset(input_stream_name, seq_label_stream='words', class_distribution=None, files=None, **kwargs)[source]#

SiameseHDFDataset class allows to do sequence sampling for weakly-supervised training. It accepts data in the format of NextGenHDFDataset and performs sampling of sequence triplets before each epoch. Triplets are tuples of the format: (anchor seq, random seq with the same label, random seq with a different label) Here we assume that each dataset from the input .hdf has a single label. In the config we can access streams by e.g. [”data:features_0”], [”data:features_1”], [”data:features_2”]. Split names depend on stream names in the input data, e.g. “features”, “data”, “classes”, etc. SiameseHDFDataset method _collect_single_seq(self, seq_idx) returns a DatasetSeq with extended dictionary of targets. “data:features_0” key stands for features of anchor sequences from the input data. In NexGenHDFDataset it would correspond to “data:features” or “data”. “data:features_1” is a key, which denote a pair of “data:features_0”. For each anchor sequence SiameseHDFDataset randomly samples a sequence with the same label. “data:features_2” denotes the third element in a triplet tuple. For each anchor sequence SiameseHDFDataset randomly samples a sequence with a different label. Targets are splitted into different streams as well, e.g. “data:classes_0”, “data:classes_1”, “data:classes_2”.

SiameseHDFDataset also supports non-uniform sampling and accepts a path to .npz matrix. Rows of this matrix should have probabilities for each of the classes to be sampled. This probability distribution might reflect class similarities.

This dataset might be useful for metric learning, where we want to learn such representations of input sequences, that those which belong to the same class are close together, while those with different labels should have representations far away from each other.

Parameters:
  • input_stream_name (str) – name of a feature stream

  • seq_label_stream (str) – name of a stream with labels

  • class_distribution (str) – path to .npz file of size n x n (n is a number of classes), where each line i contains probs of other classes to be picked in triplets when sampling a pair for element from class i

  • files (list[str]) – list of paths to .hdf files

parsers = {'feature_sequence': <class 'returnn.datasets.hdf.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'returnn.datasets.hdf.SegmentAlignmentStreamParser'>, 'sparse': <class 'returnn.datasets.hdf.SparseStreamParser'>}[source]#
add_file(path)[source]#

register input files and sequences

Parameters:

path (str) – path to single .hdf file

initialize()[source]#

initialize target_to_seqs and seq_to_target dicts

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]#
Parameters:
  • epoch (int|None) – current epoch id

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

is_data_sparse(key)[source]#
Parameters:

key (str) – e.g. “features_0” or “orth_features_0” or “words_0”

Returns:

whether the data is sparse

Return type:

bool

get_data_dim(key)[source]#
Parameters:

key (str) – e.g. “features_0”, “features_1”, “classes_0”, etc.

Returns:

number of classes, no matter if sparse or not

Return type:

int

class returnn.datasets.hdf.SimpleHDFWriter(filename, dim, labels=None, ndim=None, extra_type=None, swmr=False, extend_existing_file=False)[source]#

Intended for a simple interface, to dump data on-the-fly into a HDF file, which can be read later by HDFDataset.

Note that we dump to a temp file first, and only at close() we move it over to the real destination.

Parameters:
  • filename (str) – Create file, truncate if exists

  • dim (int|None) –

  • ndim (int) – counted without batch

  • labels (list[str]|None) –

  • extra_type (dict[str,(int,int,str)]|None) – key -> (dim,ndim,dtype)

  • swmr (bool) – see https://docs.h5py.org/en/stable/swmr.html

  • extend_existing_file (bool) – True also means we expect that it exists

insert_batch(inputs, seq_len, seq_tag, extra=None)[source]#
Parameters:
  • inputs (numpy.ndarray) – shape=(n_batch,time,data) (or (n_batch,time), or (n_batch,time1,time2), …)

  • seq_len (list[int]|dict[int,list[int]|numpy.ndarray]) – sequence lengths (per axis, excluding batch axis)

  • seq_tag (list[str|bytes]) – sequence tags of length n_batch

  • extra (dict[str,numpy.ndarray]|None) – one or multiple possible targets data. key can be “classes” or anything. The dtype and dim is inferred automatically from the Numpy array. If there are multiple items, the seq length must be the same currently. Must be batch-major, and following the time, then the feature.

close()[source]#

Closes the file.

class returnn.datasets.hdf.HDFDatasetWriter(filename)[source]#

Similar as SimpleHDFWriter, but is mostly intended to copy an existing dataset, see dump_from_dataset(). The resulting HDF file can be read later by HDFDataset.

Parameters:

filename (str) – for the HDF to write

close()[source]#

Close the HDF file.

dump_from_dataset(dataset, epoch=1, start_seq=0, end_seq=inf, use_progress_bar=True)[source]#
Parameters:
  • dataset (Dataset) – could be any dataset implemented as child of Dataset

  • epoch (int) – for dataset

  • start_seq (int) –

  • end_seq (int|float) –

  • use_progress_bar (bool) –