SprintDataset

Implements the SprintDatasetBase and ExternSprintDataset classes, some Dataset subtypes. Note that from the main RETURNN process, you probably want ExternSprintDataset instead.

class SprintDataset.SprintDatasetBase(target_maps=None, str_add_final_zero=False, input_stddev=1.0, orth_post_process=None, bpe=None, **kwargs)[source]

In Sprint, we use this object for multiple purposes: - Multiple epoch handling via SprintInterface.getSegmentList().

For this, we get the segment list from Sprint and use the Dataset shuffling method.
  • Fill in data which we get via SprintInterface.feedInput*(). Note that each such input doesn’t necessarily correspond to a single segment. This depends which type of FeatureExtractor is used in Sprint. If we use the BufferedFeatureExtractor in utterance mode, we will get one call for every segment and we get also segmentName as parameter. Otherwise, we will get batches of fixed size - in that case, it doesn’t correspond to the segments. In any case, we use this data as-is as a new seq. Because of that, we cannot really know the number of seqs in advance, nor the total number of time frames, etc.

If you want to use this directly in RETURNN, see ExternSprintDataset.

Parameters:
  • target_maps (dict[str,str|dict]) – e.g. {“speaker”: “speaker_map.txt”}
  • str_add_final_zero (bool) – adds e.g. “orth0” with ‘’-ending
  • input_stddev (float) – if != 1, will divide the input “data” by that
  • orth_post_process (str|list[str]|None) – get_post_processor_function(), applied on orth
  • bpe (None|dict[str]) – if given, will be opts for BytePairEncoding
SprintCachedSeqsMax = 200[source]
SprintCachedSeqsMin = 100[source]
useMultipleEpochs()[source]

Called via SprintInterface.getSegmentList().

setDimensions(inputDim, outputDim)[source]

Called via python_train.

initSprintEpoch(epoch)[source]

Called by SprintInterface.getSegmentList() when we start a new epoch. We must not call this via self.init_seq_order() because we will already have filled the cache by Sprint before the CRNN train thread starts the epoch.

finalizeSprint()[source]

Called when SprintInterface.getSegmentList() ends.

init_seq_order(epoch=None, seq_list=None)[source]

Called by CRNN train thread when we enter a new epoch.

waitForCrnnEpoch(epoch)[source]

Called by SprintInterface.

is_cached(start, end)[source]
Parameters:
  • start (int) – like in load_seqs(), sorted seq idx
  • end (int) – like in load_seqs(), sorted seq idx
Return type:

bool

:returns whether we have the full range (start,end) of sorted seq idx.

load_seqs(start, end)[source]

Load data sequences, such that self.get_data() & friends can return the data. :param int start: start sorted seq idx, inclusive :param int end: end sorted seq idx, exclusive

addNewData(features, targets=None, segmentName=None)[source]

Adds a new seq. This is called via the Sprint main thread. :param numpy.ndarray features: format (input-feature,time) (via Sprint) :param dict[str,numpy.ndarray|str] targets: format (time) (idx of output-feature) :returns the sorted seq index :rtype: int

finishSprintEpoch(seen_all=True)[source]

Called by SprintInterface.getSegmentList(). This is in a state where Sprint asks for the next segment after we just finished an epoch. Thus, any upcoming self.addNewData() call will contain data from a segment in the new epoch. Thus, we finish the current epoch in Sprint.

get_num_timesteps()[source]
num_seqs[source]
have_seqs()[source]
Returns:whether num_seqs > 0
Return type:bool
is_less_than_num_seqs(n)[source]
Return type:bool

:returns whether n < num_seqs. In case num_seqs is not known in advance, it will wait until it knows that n is behind the end or that we have the seq.

get_data_keys()[source]
get_target_list()[source]
set_complete_frac(frac)[source]
get_complete_frac(seq_idx)[source]
Returns:Returns a fraction (float in [0,1], always > 0) of how far we have advanced

for this seq in the dataset. This does not have to be exact. This is only for the user.

get_seq_length(sorted_seq_idx)[source]
Return type:NumbersDict
get_data(seq_idx, key)[source]
Parameters:
  • seq_idx (int) – sorted seq idx
  • key (str) – data-key, e.g. “data” or “classes”
Return type:

numpy.ndarray

Returns features or targets:
 

format 2d (time,feature) (float)

get_input_data(sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns features:
 format 2d (time,feature) (float)
get_targets(target, sorted_seq_idx)[source]
Return type:numpy.ndarray
Returns targets:
 format 1d (time) (int: idx of output-feature)
get_ctc_targets(sorted_seq_idx)[source]
get_tag(sorted_seq_idx)[source]
Parameters:sorted_seq_idx (int) –
Return type:str
class SprintDataset.ExternSprintDataset(sprintTrainerExecPath, sprintConfigStr, partitionEpoch=1, **kwargs)[source]

This is a Dataset which you can use directly in RETURNN. You can use it to get any type of data from Sprint (RWTH ASR toolkit), e.g. you can use Sprint to do feature extraction and preprocessing.

This class is like SprintDatasetBase, except that we will start an external Sprint instance ourselves which will forward the data to us over a pipe. The Sprint subprocess will use SprintExternInterface to communicate with us.

Parameters:
  • sprintTrainerExecPath (str|list[str]) –
  • | list[str] | ()->str | list[()->str] | ()->list[str] | ()->list[()->str] sprintConfigStr (str) – via eval_shell_str
reader_thread_proc(child_pid, epoch)[source]
exit_handler()[source]
init_epoch(epoch=None, seq_list=None)[source]
init_seq_order(epoch=None, seq_list=None)[source]

Called by CRNN train thread when we enter a new epoch.

num_seqs[source]
class SprintDataset.SprintCacheDataset(data, **kwargs)[source]

Can directly read Sprint cache files (and bundle files). Supports both cached features and cached alignments. For alignments, you need to provide all options for the AllophoneLabeling class, such as allophone file, etc.

Parameters:data (dict[str,dict[str]]) – data-key -> dict which keys such as filename, see SprintCacheReader constructor
class SprintCacheReader(data_key, filename, type=None, allophone_labeling=None)[source]
Parameters:
  • data_key (str) – e.g. “data” or “classes”
  • filename (str) – to Sprint cache archive
  • type (str|None) – “feat” or “align”
  • allophone_labeling (dict[str]) – kwargs for AllophoneLabeling
read(name)[source]
Parameters:name (str) – content-filename for sprint cache
Returns:numpy array of shape (time, [num_labels])
Return type:numpy.ndarray
init_seq_order(epoch=None, seq_list=None)[source]
Parameters:
  • epoch (int|None) –
  • | None seq_list (list[str]) – In case we want to set a predefined order.
Return type:

bool

:returns whether the order changed (True is always safe to return)

This is called when we start a new epoch, or at initialization. Call this when you reset the seq list.

get_dataset_seq_for_name(name, seq_idx=-1)[source]
get_data_keys()[source]
Return type:list[str]
get_target_list()[source]
Return type:list[str]
get_tag(sorted_seq_idx)[source]
Return type:str
SprintDataset.demo()[source]