returnn.datasets.audio

Datasets dealing with audio

class returnn.datasets.audio.OggZipDataset(path, audio, targets, targets_post_process=None, use_cache_manager=False, segment_file=None, zip_audio_files_have_name_as_prefix=True, fixed_random_subset=None, fixed_random_subset_seed=42, epoch_wise_filter=None, **kwargs)[source]

Generic dataset which reads a Zip file containing Ogg files for each sequence and a text document. The feature extraction settings are determined by the audio option, which is passed to ExtractAudioFeatures. Does also support Wav files, and might even support other file formats readable by the ‘soundfile’ library (not tested). By setting audio or targets to None, the dataset can be used in text only or audio only mode. The content of the zip file is:

  • a .txt file with the same name as the zipfile, containing a python list of dictionaries

  • a subfolder with the same name as the zipfile, containing the audio files

The dictionaries in the .txt file must be a list of dicts, i.e. have the following structure:

[{'text': 'some utterance text', 'duration': 2.3, 'file': 'sequence0.wav'},
 ...]

The dict can optionally also have the entry 'seq_name': 'arbitrary_sequence_name'. If seq_name is not included, the seq_tag will be the name of the file. duration is mandatory, as this information is needed for the sequence sorting, however, it does not have to match the real duration in any way.

Parameters:
  • path (str|list[str]) – filename to zip

  • audio (dict[str]|None) – options for ExtractAudioFeatures. use {} for default. None means to disable.

  • targets (Vocabulary|dict[str]|None) – options for Vocabulary.create_vocab() (e.g. BytePairEncoding)

  • targets_post_process (str|list[str]|((str)->str)|None) – get_post_processor_function(), applied on orth

  • use_cache_manager (bool) – uses returnn.util.basic.cf()

  • segment_file (str|None) – .txt or .gz text file containing sequence tags that will be used as whitelist. Note: This is somewhat deprecated, as we also support seq_list_filter_file (via the base class), which does the same but more universally.

  • zip_audio_files_have_name_as_prefix (bool)

  • fixed_random_subset (float|int|None) – Value in [0,1] to specify the fraction, or integer >=1 which specifies number of seqs. If given, will use this random subset. This will be applied initially at loading time, i.e. not dependent on the epoch. It uses the fixed fixed_random_subset_seed as seed, i.e. it’s deterministic.

  • fixed_random_subset_seed (int) – Seed for drawing the fixed random subset, default 42

  • epoch_wise_filter (dict|None) – see init_seq_order

finish_epoch(*, free_resources: bool = False)[source]

finish epoch

init_seq_order(epoch=None, seq_list=None, seq_order=None)[source]

If random_shuffle_epoch1, for epoch 1 with “random” ordering, we leave the given order as is. Otherwise, this is mostly the default behavior.

Parameters:
  • epoch (int|None)

  • seq_list (list[str]|None) – List of sequence tags, to set a predefined order.

  • seq_order (list[int]|None) – List of corpus sequence indices, to set a predefined order.

Return type:

bool

:returns whether the order changed (True is always safe to return)

supports_seq_order_sorting() bool[source]

supports sorting

get_current_seq_order()[source]
Return type:

list[int]

have_corpus_seq_idx()[source]
Return type:

bool

get_corpus_seq_idx(seq_idx: int) int[source]
Parameters:

seq_idx

get_tag(seq_idx)[source]
Parameters:

seq_idx (int)

Return type:

str

get_all_tags()[source]
Return type:

list[str]

get_total_num_seqs(*, fast: bool = False) int[source]
Return type:

int

get_data_dtype(key: str) str[source]
Returns:

dtype of data entry with key

get_data_keys() List[str][source]
Returns:

available data keys

get_data_shape(key: str)[source]

:returns get_data(*, key).shape[1:], i.e. num-frames excluded :rtype: list[int]

is_data_sparse(key: str) bool[source]
Returns:

whether data entry with key is sparse

have_get_corpus_seq() bool[source]
Returns:

whether this dataset supports get_corpus_seq()

get_corpus_seq(corpus_seq_idx: int) DatasetSeq[source]
Parameters:

corpus_seq_idx

Returns:

seq