returnn.datasets.util.feature_extraction

ExtractAudioFeatures class and related helpers

class returnn.datasets.util.feature_extraction.ExtractAudioFeatures(window_len=0.025, step_len=0.01, num_feature_filters=None, with_delta=False, norm_mean=None, norm_std_dev=None, features='mfcc', feature_options=None, random_permute=None, random_state=None, raw_ogg_opts=None, pre_process=None, post_process=None, sample_rate=None, num_channels=None, peak_normalization=True, preemphasis=None, join_frames=None)[source]

Currently uses librosa to extract MFCC/log-mel features. (Alternatives: python_speech_features, talkbox.features.mfcc, librosa)

Parameters:
  • window_len (float) – in seconds

  • step_len (float) – in seconds

  • num_feature_filters (int)

  • with_delta (bool|int)

  • norm_mean (numpy.ndarray|str|int|float|None) – if str, will interpret as filename, or “per_seq”

  • norm_std_dev (numpy.ndarray|str|int|float|None) – if str, will interpret as filename, or “per_seq”

  • features (str|function) – “mfcc”, “log_mel_filterbank”, “log_log_mel_filterbank”, “raw”, “raw_ogg”

  • feature_options (dict[str]|None) – provide additional parameters for the feature function

  • random_permute (CollectionReadCheckCovered|dict[str]|bool|None)

  • random_state (numpy.random.RandomState|None)

  • raw_ogg_opts (dict[str]|None)

  • pre_process (function|None)

  • post_process (function|None)

  • sample_rate (int|None)

  • num_channels (int|None) – number of channels in audio

  • peak_normalization (bool) – set to False to disable the peak normalization for audio files

  • preemphasis (float|None) – set a preemphasis filter coefficient

  • join_frames (int|None) – concatenate multiple frames together to a superframe

Returns:

float32 data of shape

(audio_len // int(step_len * sample_rate), num_channels (optional), (with_delta + 1) * num_feature_filters) :rtype: numpy.ndarray

get_audio_features_from_raw_bytes(raw_bytes, seq_name=None)[source]
Parameters:
  • raw_bytes (io.BytesIO)

  • seq_name (str|None)

Returns:

shape (time,feature_dim)

Return type:

numpy.ndarray

get_audio_features(audio, sample_rate, seq_name=None)[source]
Parameters:
  • audio (numpy.ndarray) – raw audio samples, shape (audio_len,)

  • sample_rate (int) – e.g. 22050

  • seq_name (str|None)

Returns:

array (time,dim), dim == self.get_feature_dimension()

Return type:

numpy.ndarray

get_feature_dimension()[source]
Return type:

int