returnn.tf.layers.signal_processing
#
Defines multiple signal processing-related layers.
- class returnn.tf.layers.signal_processing.AlternatingRealToComplexLayer(**kwargs)[source]#
This layer converts a real valued input tensor into a complex valued output tensor. For this even and odd features are considered the real and imaginary part of one complex number, respectively
- Parameters:
in_dim (Dim|None) –
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) –
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see
Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons
- classmethod get_out_data_from_opts(name, sources, n_out=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.BatchMedianPoolingLayer(pool_size=1, **kwargs)[source]#
This layer is used to pool together batches by taking their medium value. Thus the batch size is divided by pool_size. The stride is hard coded to be equal to the pool size.
- Parameters:
pool_size (int) – size of the pool to take median of (is also used as stride size)
- classmethod get_out_data_from_opts(name, sources, pool_size, n_out=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
pool_size (int) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.ComplexLinearProjectionLayer(nr_of_filters, clp_weights_init='glorot_uniform', **kwargs)[source]#
Complex linear projection layer For the original idea, see: Variani, Ehsan, et al. “Complex linear projection (CLP): A discriminative approach to joint feature extraction and acoustic modeling.” (2016).
- Parameters:
nr_of_filters (int) –
clp_weights_init (str|dict[str]|float|numpy.ndarray) –
- classmethod get_out_data_from_opts(nr_of_filters, **kwargs)[source]#
- Parameters:
nr_of_filters (int) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.ComplexToAlternatingRealLayer(**kwargs)[source]#
This layer converts a complex valued input tensor into a real valued output tensor. For this the even and odd parts of the output are considered the real and imaginary part of one complex number, respectively.
- Parameters:
in_dim (Dim|None) –
out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) –
dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training
dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see
Data.get_bc_shape()
dropout_on_forward (bool) – apply dropout during inference
mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.MaskBasedGevBeamformingLayer(nr_of_channels=1, postfilter_id=0, qralgorithm_steps=None, output_nan_filter=False, **kwargs)[source]#
This layer applies GEV beamforming to a multichannel signal. The different channels are assumed to be concatenated to the input feature vector. The first source to the layer must contain the complex spectrograms of the single channels and the second source must contain the noise and speech masks
- Parameters:
nr_of_channels (int) – number of input channels to beamforming (needed to split the feature vector)
postfilter_id (int) – Id which is specifying which post filter to apply in gev beamforming. For more information see tfSi6Proc.audioProcessing.enhancement.beamforming.TfMaskBasedGevBeamformer
int|None – nr of steps of the qr algorithm to compute eigen vector for beamforming
output_nan_filter (bool) – if set to true nan values in the beamforming output are replaced by zero
- classmethod get_out_data_from_opts(out_type=None, n_out=None, **kwargs)[source]#
- Parameters:
out_type (dict[str]|None) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.MaskBasedMvdrBeamformingWithDiagLoadingLayer(nr_of_channels=1, diag_loading_coeff=0, qralgorithm_steps=None, output_nan_filter=False, **kwargs)[source]#
This layer applies GEV beamforming to a multichannel signal. The different channels are assumed to be concatenated to the input feature vector. The first source to the layer must contain the complex spectrograms of the single channels and the second source must contain the noise and speech masks.
- Parameters:
nr_of_channels (int) – number of input channels to beamforming (needed to split the feature vector)
diag_loading_coeff (int) – weighting coefficient for diagonal loading.
qralgorithm_steps (int|None) – nr of steps of the qr algorithm to compute eigen vector for beamforming
output_nan_filter (bool) – if set to true nan values in the beamforming output are replaced by zero
- classmethod get_out_data_from_opts(out_type=None, n_out=None, **kwargs)[source]#
- Parameters:
out_type (dict[str]|None) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.MelFilterbankLayer(sampling_rate=16000, fft_size=1024, nr_of_filters=80, f_min=None, f_max=None, **kwargs)[source]#
This layer applies the Mel filterbank to the input.
- Parameters:
sampling_rate (int) – sampling rate of the signal which the input originates from
fft_size (int) – fft_size with which the time signal was transformed into the intput
nr_of_filters (int) – number of output filter bins
f_min (float) – minimum frequency for mel filters
f_max (float) – maximum frequency for mel filters
- classmethod get_out_data_from_opts(name, sources, n_out=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.MultiChannelMultiResolutionStftLayer(frame_shift, frame_sizes, fft_sizes, window='hanning', use_rfft=True, nr_of_channels=1, pad_last_frame=False, **kwargs)[source]#
The layer applys a STFT to every channel separately and concatenates the frequency domain vectors for every frame. The STFT is applied with multiple different frame- and FFT-sizes and the resulting multi-channel STFTs are concatenated. Resulting in a tensor with the content [res_0-ch_0, …, res_0-ch_N, res_1-ch_0, … res_M-ch_N] The subsampling from T input samples to T’ output frames is computed as follows: T’ = (T - frame_size) / frame_shift + 1 frame_shift is the same for all resolutions and T’ is computed according to a reference frame_size which is taken to be frame_sizes[0]. For all other frame sizes the input is zero-padded or the output is cut to obtain the same T’ as for the reference frame_size.
- Parameters:
frame_shift (int) – frame shift for stft in samples
frame_sizes (list[int]) – frame size for stft in samples
fft_sizes (list[int]) – fft size in samples
window (str) – id of the windowing function used. Possible options are: - hanning
use_rfft (bool) – if set to true a real input signal is expected and only the significant half of the FFT bins are returned
nr_of_channels (int) – number of input channels
pad_last_frame (bool) – padding of last frame with zeros or discarding of last frame
- classmethod get_out_data_from_opts(fft_sizes, use_rfft=True, nr_of_channels=1, **kwargs)[source]#
- Parameters:
fft_sizes (list[int]) –
use_rfft (bool) –
nr_of_channels (int) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.MultiChannelStftLayer(frame_shift, frame_size, fft_size, window='hanning', use_rfft=True, nr_of_channels=1, pad_last_frame=False, **kwargs)[source]#
The layer applys a STFT to every channel separately and concatenates the frequency domain vectors for every frame.
- Parameters:
frame_shift (int) – frame shift for stft in samples
frame_sizes (list[int]) – frame size for stft in samples
fft_sizes (list[int]) – fft size in samples
window (str) – id of the windowing function used. Possible options are: - hanning
use_rfft (bool) – if set to true a real input signal is expected and only the significant half of the FFT bins are returned
nr_of_channels (int) – number of input channels
pad_last_frame (bool) – padding of last frame with zeros or discarding of last frame
- classmethod get_out_data_from_opts(fft_size, use_rfft=True, nr_of_channels=1, **kwargs)[source]#
- Parameters:
fft_size (int) –
use_rfft (bool) –
nr_of_channels (int) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.StftLayer(frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, use_time_mask=False, **kwargs)[source]#
A generic STFT layer.
- Parameters:
- classmethod get_out_data_from_opts(name, sources, network, frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
network (returnn.tf.network.TFNetwork) –
frame_shift (int) – frame shift for STFT
frame_size (int) – frame size for STFT
fft_size (Optional[int]) – size of the FFT to apply
in_spatial_dims (list[Dim|str]|None) –
out_spatial_dims (list[Dim]|None) –
out_dim (Dim|None) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.IstftLayer(frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, use_time_mask=False, **kwargs)[source]#
A generic iSTFT layer.
- Parameters:
- classmethod get_out_data_from_opts(name, sources, network, frame_shift, frame_size, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
network (returnn.tf.network.TFNetwork) –
frame_shift (int) – frame shift for STFT
frame_size (int) – frame size for STFT
in_spatial_dims (list[Dim|str]|None) –
out_spatial_dims (list[Dim]|None) –
out_dim (Dim|None) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.NoiseEstimationByFirstTFramesLayer(nr_of_frames, **kwargs)[source]#
Estimates noise from the first t time frames.
- Parameters:
nr_of_frames (int) – first nr_of_frames frames are used for averaging all frames are used if nr_of_frames is -1
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.ParametricWienerFilterLayer(l_overwrite=None, p_overwrite=None, q_overwrite=None, filter_input=None, parameters=None, noise_estimation=None, average_parameters=False, **kwargs)[source]#
Parametric Wiener Filter For related paper, see: Menne, Tobias, Ralf Schlueter, and Hermann Ney. “Investigation into joint optimization of single channel speech enhancement and acoustic modeling for robust ASR.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
- Parameters:
l_overwrite (float|None) – if given overwrites the l value of the parametric wiener filter with given constant
p_overwrite (float|None) – if given overwrites the p value of the parametric wiener filter with given constant
q_overwrite (float|None) – if given overwrites the q value of the parametric wiener filter with given constant
filter_input (LayerBase|None) – name of layer containing input for wiener filter
parameters (LayerBase|None) – name of layer containing parameters for wiener filter
noise_estimation (LayerBase|None) – name of layer containing noise estimate for wiener filter
average_parameters (bool) – if set to true the parameters l, p and q are averaged over the time axis
- classmethod transform_config_dict(d, network, get_layer)[source]#
- Parameters:
d (dict[str]) –
network (TFNetwork.TFNetwork) –
get_layer (((str) -> LayerBase)) –
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.SignalMaskingLayer(signal, mask, **kwargs)[source]#
Mask a given signal using a given mask.
- Parameters:
- classmethod transform_config_dict(d, network, get_layer)[source]#
- Parameters:
d (dict[str]) –
network (TFNetwork.TFNetwork) –
get_layer (((str) -> LayerBase)) –
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.SplitConcatMultiChannel(nr_of_channels=1, **kwargs)[source]#
This layer assumes the feature vector to be a concatenation of features of multiple channels (of the same size). It splits the feature dimension into equisized number of channel features and stacks them in the batch dimension. Thus the batch size is multiplied with the number of channels and the feature size is divided by the number of channels. The channels of one singal will have consecutive batch indices, meaning the signal of the original batch index n is split and can now be found in batch indices (n * nr_of_channels) to ((n+1) * nr_of_channels - 1)
- Parameters:
nr_of_channels (int) – the number of concatenated channels in the feature dimension
- classmethod get_out_data_from_opts(name, sources, nr_of_channels, n_out=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
nr_of_channels (int) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- class returnn.tf.layers.signal_processing.TileFeaturesLayer(repetitions=1, **kwargs)[source]#
This function is tiling features with given number of repetitions.
- Parameters:
repetitions (int) – number of tiling repetitions in feature domain
- output_before_activation: Optional[OutputWithActivation][source]#
- search_choices: Optional[SearchChoices][source]#
- saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
- classmethod get_out_data_from_opts(name, sources, repetitions, n_out=None, **kwargs)[source]#
- Parameters:
name (str) –
sources (list[LayerBase]) –
repetitions (int) –
n_out (int|None|returnn.util.basic.NotSpecified) –
- Return type:
Data