returnn.datasets.util.feature_extraction
#
ExtractAudioFeatures class and related helpers
- class returnn.datasets.util.feature_extraction.ExtractAudioFeatures(window_len=0.025, step_len=0.01, num_feature_filters=None, with_delta=False, norm_mean=None, norm_std_dev=None, features='mfcc', feature_options=None, random_permute=None, random_state=None, raw_ogg_opts=None, pre_process=None, post_process=None, sample_rate=None, num_channels=None, peak_normalization=True, preemphasis=None, join_frames=None)[source]#
Currently uses librosa to extract MFCC/log-mel features. (Alternatives: python_speech_features, talkbox.features.mfcc, librosa)
- Parameters:
window_len (float) – in seconds
step_len (float) – in seconds
num_feature_filters (int) –
with_delta (bool|int) –
norm_mean (numpy.ndarray|str|int|float|None) – if str, will interpret as filename, or “per_seq”
norm_std_dev (numpy.ndarray|str|int|float|None) – if str, will interpret as filename, or “per_seq”
features (str|function) – “mfcc”, “log_mel_filterbank”, “log_log_mel_filterbank”, “raw”, “raw_ogg”
feature_options (dict[str]|None) – provide additional parameters for the feature function
random_permute (CollectionReadCheckCovered|dict[str]|bool|None) –
random_state (numpy.random.RandomState|None) –
raw_ogg_opts (dict[str]|None) –
pre_process (function|None) –
post_process (function|None) –
sample_rate (int|None) –
num_channels (int|None) – number of channels in audio
peak_normalization (bool) – set to False to disable the peak normalization for audio files
preemphasis (float|None) – set a preemphasis filter coefficient
join_frames (int|None) – concatenate multiple frames together to a superframe
- Returns:
float32 data of shape
(audio_len // int(step_len * sample_rate), num_channels (optional), (with_delta + 1) * num_feature_filters) :rtype: numpy.ndarray
- get_audio_features_from_raw_bytes(raw_bytes, seq_name=None)[source]#
- Parameters:
raw_bytes (io.BytesIO) –
seq_name (str|None) –
- Returns:
shape (time,feature_dim)
- Return type:
numpy.ndarray