Audio Datasets

Extern Sprint Dataset

class SprintDataset.ExternSprintDataset(sprintTrainerExecPath, sprintConfigStr, partitionEpoch=None, **kwargs)[source]

Bases: returnn.datasets.sprint.SprintDatasetBase

This is a Dataset which you can use directly in RETURNN. You can use it to get any type of data from Sprint (RWTH ASR toolkit), e.g. you can use Sprint to do feature extraction and preprocessing.

This class is like SprintDatasetBase, except that we will start an external Sprint instance ourselves which will forward the data to us over a pipe. The Sprint subprocess will use SprintExternInterface to communicate with us.

  • sprintTrainerExecPath (str|list[str]) –
  • | list[str] | ()->str | list[()->str] | ()->list[str] | ()->list[()->str] sprintConfigStr (str) – via eval_shell_str
  • partitionEpoch (int|None) – deprecated. use partition_epoch instead

Ogg Zip Dataset

class GeneratingDataset.OggZipDataset(path, audio, targets, targets_post_process=None, use_cache_manager=False, segment_file=None, zip_audio_files_have_name_as_prefix=True, fixed_random_seed=None, fixed_random_subset=None, epoch_wise_filter=None, **kwargs)[source]

Bases: returnn.datasets.cached2.CachedDataset2

Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the audio option, which is passed to ExtractAudioFeatures. Does also support Wav files, and might even support other file formats readable by the ‘soundfile’ library (not tested). By setting audio or targets to None, the dataset can be used in text only or audio only mode. The content of the zip file is:

  • a .txt file with the same name as the zipfile, containing a python list of dictionaries
  • a subfolder with the same name as the zipfile, containing the audio files

The dictionaries in the .txt file must be a list of dicts, i.e. have the following structure:

[{'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'},

The dict can optionally also have the entry 'seq_name': 'arbitrary_sequence_name'. If seq_name is not included, the seq_tag will be the name of the file. duration is mandatory, as this information is needed for the sequence sorting, however, it does not have to match the real duration in any way.

  • path (str|list[str]) – filename to zip
  • audio (dict[str]|None) – options for ExtractAudioFeatures. use {} for default. None means to disable.
  • targets (dict[str]|None) – options for Vocabulary.create_vocab() (e.g. BytePairEncoding)
  • targets_post_process (str|list[str]|((str)->str)|None) – get_post_processor_function(), applied on orth
  • use_cache_manager (bool) – uses
  • segment_file (str|None) – .txt or .gz text file containing sequence tags that will be used as whitelist
  • zip_audio_files_have_name_as_prefix (bool) –
  • fixed_random_seed (int|None) – for the shuffling, e.g. for seq_ordering=’random’. otherwise epoch will be used
  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It will use an internally hardcoded fixed random seed, i.e. it’s deterministic.
  • epoch_wise_filter (dict|None) – see init_seq_order