returnn.datasets.audio
¶
Datasets dealing with audio
- class returnn.datasets.audio.OggZipDataset(path, audio, targets, targets_post_process=None, use_cache_manager=False, segment_file=None, zip_audio_files_have_name_as_prefix=True, fixed_random_subset=None, fixed_random_subset_seed=42, epoch_wise_filter=None, **kwargs)[source]¶
Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the
audio
option, which is passed toExtractAudioFeatures
. Does also support Wav files, and might even support other file formats readable by the ‘soundfile’ library (not tested). By settingaudio
ortargets
toNone
, the dataset can be used in text only or audio only mode. The content of the zip file is:a .txt file with the same name as the zipfile, containing a python list of dictionaries
a subfolder with the same name as the zipfile, containing the audio files
The dictionaries in the .txt file must be a list of dicts, i.e. have the following structure:
[{'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'}, ...]
The dict can optionally also have the entry
'seq_name': 'arbitrary_sequence_name'
. Ifseq_name
is not included, the seq_tag will be the name of the file.duration
is mandatory, as this information is needed for the sequence sorting, however, it does not have to match the real duration in any way.- Parameters:
path (str|list[str]) – filename to zip
audio (dict[str]|None) – options for
ExtractAudioFeatures
. use {} for default. None means to disable.targets (Vocabulary|dict[str]|None) – options for
Vocabulary.create_vocab()
(e.g.BytePairEncoding
)targets_post_process (str|list[str]|((str)->str)|None) –
get_post_processor_function()
, applied on orthuse_cache_manager (bool) – uses
returnn.util.basic.cf()
segment_file (str|None) – .txt or .gz text file containing sequence tags that will be used as whitelist. Note: This is somewhat deprecated, as we also support
seq_list_filter_file
(via the base class), which does the same but more universally.zip_audio_files_have_name_as_prefix (bool)
fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It uses the fixed fixed_random_subset_seed as seed, i.e. it’s deterministic.
fixed_random_subset_seed (int) – Seed for drawing the fixed random subset, default 42
epoch_wise_filter (dict|None) – see init_seq_order
- init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]¶
If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.
- Parameters:
epoch (int|None)
seq_list (list[str]|None) – List of sequence tags, to set a predefined order.
seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.
- Return type:
bool
:returns whether the order changed (True is always safe to return)
- get_data_shape(key: str)[source]¶
:returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int]
- have_get_corpus_seq() bool [source]¶
- Returns:
whether this dataset supports
get_corpus_seq()
- get_corpus_seq(corpus_seq_idx: int) DatasetSeq [source]¶
- Parameters:
corpus_seq_idx
- Returns:
seq