Audio Datasets¶

Extern Sprint Dataset¶

class SprintDataset.ExternSprintDataset(sprintTrainerExecPath, sprintConfigStr, partitionEpoch=None, **kwargs)[source]¶

Bases: SprintDatasetBase

This is a Dataset which you can use directly in RETURNN. You can use it to get any type of data from Sprint (RWTH ASR toolkit), e.g. you can use Sprint to do feature extraction and preprocessing.

This class is like SprintDatasetBase, except that we will start an external Sprint instance ourselves which will forward the data to us over a pipe. The Sprint subprocess will use SprintExternInterface to communicate with us.

Parameters:

sprintTrainerExecPath (str|list[str])
sprintConfigStr (str | list[str] | ()->str | list[()->str] | ()->list[str] | ()->list[()->str]) – via eval_shell_str
partitionEpoch (int|None) – deprecated. use partition_epoch instead

Ogg Zip Dataset¶

class GeneratingDataset.OggZipDataset(path: str | Sequence[str], *, content_name: str | Sequence[str] | None = None, resolve_symlink_for_name: bool = False, audio: Dict[str, Any] | None, targets: Vocabulary | Dict[str, Any] | None, targets_post_process=None, use_cache_manager: bool = False, segment_file: str | None = None, zip_audio_files_have_name_as_prefix: bool = True, fixed_random_subset: float | int | None = None, fixed_random_subset_seed: int = 42, epoch_wise_filter: dict | None = None, **kwargs)[source]¶

Bases: CachedDataset2

Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the audio option, which is passed to ExtractAudioFeatures. Does also support Wav files, and might even support other file formats readable by the ‘soundfile’ library (not tested). By setting audio or targets to None, the dataset can be used in text only or audio only mode. The content of the zip file is:

a .txt file with the same name as the zipfile, containing a python list of dictionaries

a subfolder with the same name as the zipfile, containing the audio files

The dictionaries in the .txt file must be a list of dicts, i.e. have the following structure:

[{'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'},
 ...]

The dict can optionally also have the entry 'seq_name': 'arbitrary_sequence_name'. If seq_name is not included, the seq_tag will be the name of the file. duration is mandatory, as this information is needed for the sequence sorting, however, it does not have to match the real duration in any way.

Parameters:

path – filename to zip
content_name – internal name of the dataset, which is used for filenames inside the ZIP. If not given, the internal name is determined as os.path.splitext(os.path.basename(path))[0].
resolve_symlink_for_name – The internal name of the dataset, which is used for filenames inside the ZIP, is determined as os.path.splitext(os.path.basename(path))[0] (if content_name is not given). If this is True, we will resolve symlinks for the name first.
audio – options for ExtractAudioFeatures. use {} for default. None means to disable.
targets – options for Vocabulary.create_vocab() (e.g. BytePairEncoding)
targets_post_process (str|list[str]|((str)->str)|None) – get_post_processor_function(), applied on orth
use_cache_manager – uses returnn.util.basic.cf()
segment_file – .txt or .gz text file containing sequence tags that will be used as whitelist. Note: This is somewhat deprecated, as we also support seq_list_filter_file (via the base class), which does the same but more universally.
zip_audio_files_have_name_as_prefix
fixed_random_subset – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It uses the fixed fixed_random_subset_seed as seed, i.e. it’s deterministic.
fixed_random_subset_seed – Seed for drawing the fixed random subset, default 42
epoch_wise_filter – see init_seq_order