HDFDataset

Provides HDFDataset.

class HDFDataset.HDFDataset(files=None, use_cache_manager=False, **kwargs)[source]

Dataset based on HDF files. This was the main original dataset format of RETURNN.

Parameters:
  • files (None|list[str]) –
  • use_cache_manager (bool) – uses Util.cf() for files
add_file(self, filename)[source]
Setups data:
self.file_start self.file_seq_start

Use load_seqs() to load the actual data. :type filename: str

get_data(self, seq_idx, key)[source]
Parameters:
  • seq_idx (int) –
  • key (str) –
Return type:

numpy.ndarray

get_input_data(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:numpy.ndarray
get_targets(self, target, sorted_seq_idx)[source]
Parameters:
  • target (str) –
  • sorted_seq_idx (int) –
Return type:

numpy.ndarray

get_tag(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
get_all_tags(self)[source]
Return type:list[str]
get_total_num_seqs(self)[source]
Return type:int
is_data_sparse(self, key)[source]
Parameters:key (str) –
Return type:bool
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
len_info(self)[source]
Return type:str
class HDFDataset.StreamParser(seq_names, stream)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
get_dtype(self)[source]
class HDFDataset.FeatureSequenceStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.SparseStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.SegmentAlignmentStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.NextGenHDFDataset(input_stream_name, files=None, **kwargs)[source]

Another separate dataset which uses HDF files to store the data.

Parameters:
  • input_stream_name (str) –
  • files (None|list[str]) –
parsers = {'feature_sequence': <class 'HDFDataset.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'HDFDataset.SegmentAlignmentStreamParser'>, 'sparse': <class 'HDFDataset.SparseStreamParser'>}[source]
add_file(self, path)[source]
initialize(self)[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
class HDFDataset.SiameseHDFDataset(input_stream_name, seq_label_stream='words', class_distribution=None, files=None, **kwargs)[source]

SiameseHDFDataset class allows to do sequence sampling for weakly-supervised training. It accepts data in the format of NextGenHDFDataset and performs sampling of sequence triplets before each epoch. Triplets are tuples of the format: (anchor seq, random seq with the same label, random seq with a different label) Here we assume that each dataset from the input .hdf has a single label. In the config we can access streams by e.g. [“data:features_0”], [“data:features_1”], [“data:features_2”]. Split names depend on stream names in the input data, e.g. “features”, “data”, “classes”, etc. SiameseHDFDataset method _collect_single_seq(self, seq_idx) returns a DatasetSeq with extended dictionary of targets. “data:features_0” key stands for features of anchor sequences from the input data. In NexGenHDFDataset it would correspond to “data:features” or “data”. “data:features_1” is a key, which denote a pair of “data:features_0”. For each anchor sequence SiameseHDFDataset randomly samples a sequence with the same label. “data:features_2” denotes the third element in a triplet tuple. For each anchor sequence SiameseHDFDataset randomly samples a sequence with a different label. Targets are splitted into different streams as well, e.g. “data:classes_0”, “data:classes_1”, “data:classes_2”.

SiameseHDFDataset also supports non-uniform sampling and accepts a path to .npz matrix. Rows of this matrix should have probabilities for each of the classes to be sampled. This probability distribution might reflect class similarities.

This dataset might be useful for metric learning, where we want to learn such representations of input sequences, that those which belong to the same class are close together, while those with different labels should have representations far away from each other.

Parameters:
  • input_stream_name (str) – name of a feature stream
  • seq_label_stream (str) – name of a stream with labels
  • class_distribution (str) – path to .npz file of size n x n (n is a number of classes), where each line i contains probs of other classes to be picked in triplets when sampling a pair for element from class i
  • files (list[str]) – list of paths to .hdf files
parsers = {'feature_sequence': <class 'HDFDataset.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'HDFDataset.SegmentAlignmentStreamParser'>, 'sparse': <class 'HDFDataset.SparseStreamParser'>}[source]
add_file(self, path)[source]

register input files and sequences :param path: path to single .hdf file

initialize(self)[source]

initialize target_to_seqs and seq_to_target dicts

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) – current epoch id
  • | None seq_list (list[str]) – In case we want to set a predefined order.
is_data_sparse(self, key)[source]
Parameters:key (str) – e.g. “features_0” or “orth_features_0” or “words_0”
Returns:whether the data is sparse
Return type:bool
get_data_dim(self, key)[source]
Parameters:key (str) – e.g. “features_0”, “features_1”, “classes_0”, etc.
Returns:number of classes, no matter if sparse or not
Return type:int
class HDFDataset.SimpleHDFWriter(filename, dim, labels=None, ndim=None, extra_type=None, swmr=False)[source]

Intended for a simple interface, to dump data on-the-fly into a HDF file, which can be read later by HDFDataset.

Note that we dump to a temp file first, and only at close() we move it over to the real destination.

Parameters:
  • filename (str) – Create file, truncate if exists
  • dim (int|None) –
  • ndim (int) – counted without batch
  • labels (list[str]|None) –
  • extra_type (dict[str,(int,int,str)]|None) – key -> (dim,ndim,dtype)
  • swmr (bool) – see http://docs.h5py.org/en/stable/swmr.html
insert_batch(self, inputs, seq_len, seq_tag, extra=None)[source]
Parameters:
  • inputs (numpy.ndarray) – shape=(n_batch,time,data) (or (n_batch,time), or (n_batch,time1,time2), …)
  • seq_len (list[int]|dict[int,list[int]|numpy.ndarray]) – sequence lengths (per axis, excluding batch axis)
  • seq_tag (list[str|bytes]) – sequence tags of length n_batch
  • extra (dict[str,numpy.ndarray]|None) – one or multiple possible targets data. key can be “classes” or anything. The dtype and dim is inferred automatically from the Numpy array. If there are multiple items, the seq length must be the same currently. Must be batch-major, and following the time, then the feature.
close(self)[source]

Closes the file.

class HDFDataset.HDFDatasetWriter(filename)[source]

Similar as SimpleHDFWriter, but is mostly intended to copy an existing dataset, see dump_from_dataset(). The resulting HDF file can be read later by HDFDataset.

Parameters:filename (str) – for the HDF to write
close(self)[source]

Close the HDF file.

dump_from_dataset(self, dataset, epoch=1, start_seq=0, end_seq=inf, use_progress_bar=True)[source]
Parameters:
  • dataset (Dataset) – could be any dataset implemented as child of Dataset
  • epoch (int) – for dataset
  • start_seq (int) –
  • end_seq (int|float) –
  • use_progress_bar (bool) –