HDFDataset

Provides HDFDataset.

class HDFDataset.HDFDataset(files=None, use_cache_manager=False, **kwargs)[source]
Parameters:
  • files (None|list[str]) –
  • use_cache_manager (bool) – uses Util.cf() for files
files = None[source]
Type:list[str]
file_seq_start = None[source]
Type:list[numpy.ndarray]
data_dtype = None[source]
Type:dict[str,str]
data_sparse = None[source]
Type:dict[str,bool]
add_file(self, filename)[source]
Setups data:
self.file_start self.file_seq_start

Use load_seqs() to load the actual data. :type filename: str

get_data(self, seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_input_data(self, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_targets(self, target, sorted_seq_idx)[source]
Parameters:target (str) – data key
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_tag(self, sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
get_all_tags(self)[source]
Returns:list of all seq tags, of the whole dataset, without partition epoch. Note that this is not possible with all datasets.
Return type:list[str]
get_total_num_seqs(self)[source]
Returns:total number of seqs, without partition epoch. Should be the same as len(self.get_all_tags()). Note that this is not possible with all datasets.
Return type:int
is_data_sparse(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:whether the data is sparse
Return type:bool
get_data_dtype(self, key)[source]
Parameters:key (str) – e.g. “data” or “classes”
Returns:dtype as str, e.g. “int32” or “float32”
Return type:str
len_info(self)[source]
Return type:str

:returns a string to present the user as information about our len. Depending on our implementation, we can give some more or some less information.

class HDFDataset.StreamParser(seq_names, stream)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
get_dtype(self)[source]
class HDFDataset.FeatureSequenceStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.SparseStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.SegmentAlignmentStreamParser(*args, **kwargs)[source]
get_data(self, seq_name)[source]
get_seq_length(self, seq_name)[source]
class HDFDataset.NextGenHDFDataset(input_stream_name, files=None, **kwargs)[source]
Parameters:
  • input_stream_name (str) –
  • files (None|list[str]) –
parsers = {'feature_sequence': <class 'HDFDataset.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'HDFDataset.SegmentAlignmentStreamParser'>, 'sparse': <class 'HDFDataset.SparseStreamParser'>}[source]
add_file(self, path)[source]
initialize(self)[source]

Does the main initialization before it can be used. This needs to be called before self.load_seqs() can be used.

init_seq_order(self, epoch=None, seq_list=None)[source]
Parameters:| None seq_list (list[str]) – In case we want to set a predefined order.
get_data_dtype(self, key)[source]
Parameters:key (str) –
Return type:str
class HDFDataset.SiameseHDFDataset(input_stream_name, seq_label_stream='words', class_distribution=None, files=None, *args, **kwargs)[source]

SiameseHDFDataset class allows to do sequence sampling for weakly-supervised training. It accepts data in the format of NextGenHDFDataset and performs sampling of sequence triplets before each epoch. Triplets are tuples of the format: (anchor seq, random seq with the same label, random seq with a different label) Here we assume that each dataset from the input .hdf has a single label. In the config we can access streams by e.g. [“data:features_0”], [“data:features_1”], [“data:features_2”]. Split names depend on stream names in the input data, e.g. “features”, “data”, “classes”, etc. SiameseHDFDataset method _collect_single_seq(self, seq_idx) returns a DatasetSeq with extended dictionary of targets. “data:features_0” key stands for features of anchor sequences from the input data. In NexGenHDFDataset it would correspond to “data:features” or “data”. “data:features_1” is a key, which denote a pair of “data:features_0”. For each anchor sequence SiameseHDFDataset randomly samples a sequence with the same label. “data:features_2” denotes the third element in a triplet tuple. For each anchor sequence SiameseHDFDataset randomly samples a sequence with a different label. Targets are splitted into different streams as well, e.g. “data:classes_0”, “data:classes_1”, “data:classes_2”.

SiameseHDFDataset also supports non-uniform sampling and accepts a path to .npz matrix. Rows of this matrix should have probabilities for each of the classes to be sampled. This probability distribution might reflect class similarities.

This dataset might be useful for metric learning, where we want to learn such representations of input sequences, that those which belong to the same class are close together, while those with different labels should have representations far away from each other.

Parameters:
  • input_stream_name (str) – name of a feature stream
  • seq_label_stream (str) – name of a stream with labels
  • class_distribution (str) – path to .npz file of size n x n (n is a number of classes), where each line i contains probs of other classes to be picked in triplets when sampling a pair for element from class i
  • files – list of paths to .hdf files
  • args – dict[str]
  • kwargs – dict[str]
parsers = {'feature_sequence': <class 'HDFDataset.FeatureSequenceStreamParser'>, 'segment_alignment': <class 'HDFDataset.SegmentAlignmentStreamParser'>, 'sparse': <class 'HDFDataset.SparseStreamParser'>}[source]
add_file(self, path)[source]

register input files and sequences :param path: path to single .hdf file

initialize(self)[source]

initialize target_to_seqs and seq_to_target dicts

init_seq_order(self, epoch=None, seq_list=None)[source]

:param epoch int|None : current epoch id :param list[str] | None seq_list: In case we want to set a predefined order.

is_data_sparse(self, key)[source]
Parameters:key (str) – e.g. “features_0” or “orth_features_0” or “words_0”
Returns:whether the data is sparse
Return type:bool
get_data_dim(self, key)[source]
Parameters:key (str) – e.g. “features_0”, “features_1”, “classes_0”, etc.
Returns:number of classes, no matter if sparse or not
Return type:int
class HDFDataset.SimpleHDFWriter(filename, dim, labels=None, ndim=None, swmr=False)[source]
Parameters:
insert_batch(self, inputs, seq_len, seq_tag, extra=None)[source]
Parameters:
  • inputs (numpy.ndarray) – shape=(n_batch,time,data) (or (n_batch,time), or (n_batch,time1,time2), …)
  • seq_len (list[int]|dict[int,list[int]|numpy.ndarray]) – sequence lengths (per axis, excluding batch axis)
  • seq_tag (list[str|bytes]) – sequence tags of length n_batch
  • extra (dict[str,numpy.ndarray]|None) –
close(self)[source]
class HDFDataset.HDFDatasetWriter(filename)[source]
Parameters:filename (str) – for the HDF to write
close(self)[source]
dump_from_dataset(self, dataset, epoch=1, start_seq=0, end_seq=inf, use_progress_bar=True)[source]
Parameters:
  • dataset (Dataset) – could be any dataset implemented as child of Dataset
  • epoch (int) – for dataset
  • start_seq (int) –
  • end_seq (int|float) –
  • use_progress_bar (bool) –