returnn.tf.layers.signal_processing#

Defines multiple signal processing-related layers.

class returnn.tf.layers.signal_processing.AlternatingRealToComplexLayer(**kwargs)[source]#

This layer converts a real valued input tensor into a complex valued output tensor. For this even and odd features are considered the real and imaginary part of one complex number, respectively

Parameters:
  • in_dim (Dim|None) –

  • out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) –

  • dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training

  • dropout_axis (Dim|str|list[Dim|str]|None) –

  • dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()

  • dropout_on_forward (bool) – apply dropout during inference

  • mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'alternating_real_to_complex'[source]#
classmethod get_out_data_from_opts(name, sources, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.BatchMedianPoolingLayer(pool_size=1, **kwargs)[source]#

This layer is used to pool together batches by taking their medium value. Thus the batch size is divided by pool_size. The stride is hard coded to be equal to the pool size.

Parameters:

pool_size (int) – size of the pool to take median of (is also used as stride size)

layer_class: Optional[str] = 'batch_median_pooling'[source]#
classmethod get_out_data_from_opts(name, sources, pool_size, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.ComplexLinearProjectionLayer(nr_of_filters, clp_weights_init='glorot_uniform', **kwargs)[source]#

Complex linear projection layer For the original idea, see: Variani, Ehsan, et al. “Complex linear projection (CLP): A discriminative approach to joint feature extraction and acoustic modeling.” (2016).

Parameters:
  • nr_of_filters (int) –

  • clp_weights_init (str|dict[str]|float|numpy.ndarray) –

layer_class: Optional[str] = 'complex_linear_projection'[source]#
classmethod get_out_data_from_opts(nr_of_filters, **kwargs)[source]#
Parameters:

nr_of_filters (int) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.ComplexToAlternatingRealLayer(**kwargs)[source]#

This layer converts a complex valued input tensor into a real valued output tensor. For this the even and odd parts of the output are considered the real and imaginary part of one complex number, respectively.

Parameters:
  • in_dim (Dim|None) –

  • out_shape (set[Dim|returnn.tf.util.data._MarkedDim]|tuple|list|None) –

  • dropout (float) – 0.0 means to apply no dropout. dropout will only be applied during training

  • dropout_axis (Dim|str|list[Dim|str]|None) –

  • dropout_noise_shape (dict[Dim|str|list[Dim|str]|tuple[Dim|str],int|str|None]|None) – see Data.get_bc_shape()

  • dropout_on_forward (bool) – apply dropout during inference

  • mask (str|None) – “dropout” or “unity” or None. this is obsolete and only here for historical reasons

layer_class: Optional[str] = 'complex_to_alternating_real'[source]#
kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.MaskBasedGevBeamformingLayer(nr_of_channels=1, postfilter_id=0, qralgorithm_steps=None, output_nan_filter=False, **kwargs)[source]#

This layer applies GEV beamforming to a multichannel signal. The different channels are assumed to be concatenated to the input feature vector. The first source to the layer must contain the complex spectrograms of the single channels and the second source must contain the noise and speech masks

Parameters:
  • nr_of_channels (int) – number of input channels to beamforming (needed to split the feature vector)

  • postfilter_id (int) – Id which is specifying which post filter to apply in gev beamforming. For more information see tfSi6Proc.audioProcessing.enhancement.beamforming.TfMaskBasedGevBeamformer

  • int|None – nr of steps of the qr algorithm to compute eigen vector for beamforming

  • output_nan_filter (bool) – if set to true nan values in the beamforming output are replaced by zero

layer_class: Optional[str] = 'mask_based_gevbeamforming'[source]#
classmethod get_out_data_from_opts(out_type=None, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.MaskBasedMvdrBeamformingWithDiagLoadingLayer(nr_of_channels=1, diag_loading_coeff=0, qralgorithm_steps=None, output_nan_filter=False, **kwargs)[source]#

This layer applies GEV beamforming to a multichannel signal. The different channels are assumed to be concatenated to the input feature vector. The first source to the layer must contain the complex spectrograms of the single channels and the second source must contain the noise and speech masks.

Parameters:
  • nr_of_channels (int) – number of input channels to beamforming (needed to split the feature vector)

  • diag_loading_coeff (int) – weighting coefficient for diagonal loading.

  • qralgorithm_steps (int|None) – nr of steps of the qr algorithm to compute eigen vector for beamforming

  • output_nan_filter (bool) – if set to true nan values in the beamforming output are replaced by zero

layer_class: Optional[str] = 'mask_based_mvdrbeamforming'[source]#
classmethod get_out_data_from_opts(out_type=None, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.MelFilterbankLayer(sampling_rate=16000, fft_size=1024, nr_of_filters=80, f_min=None, f_max=None, **kwargs)[source]#

This layer applies the Mel filterbank to the input.

Parameters:
  • sampling_rate (int) – sampling rate of the signal which the input originates from

  • fft_size (int) – fft_size with which the time signal was transformed into the intput

  • nr_of_filters (int) – number of output filter bins

  • f_min (float) – minimum frequency for mel filters

  • f_max (float) – maximum frequency for mel filters

layer_class: Optional[str] = 'mel_filterbank'[source]#
classmethod get_out_data_from_opts(name, sources, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.MultiChannelMultiResolutionStftLayer(frame_shift, frame_sizes, fft_sizes, window='hanning', use_rfft=True, nr_of_channels=1, pad_last_frame=False, **kwargs)[source]#

The layer applys a STFT to every channel separately and concatenates the frequency domain vectors for every frame. The STFT is applied with multiple different frame- and FFT-sizes and the resulting multi-channel STFTs are concatenated. Resulting in a tensor with the content [res_0-ch_0, …, res_0-ch_N, res_1-ch_0, … res_M-ch_N] The subsampling from T input samples to T’ output frames is computed as follows: T’ = (T - frame_size) / frame_shift + 1 frame_shift is the same for all resolutions and T’ is computed according to a reference frame_size which is taken to be frame_sizes[0]. For all other frame sizes the input is zero-padded or the output is cut to obtain the same T’ as for the reference frame_size.

Parameters:
  • frame_shift (int) – frame shift for stft in samples

  • frame_sizes (list[int]) – frame size for stft in samples

  • fft_sizes (list[int]) – fft size in samples

  • window (str) – id of the windowing function used. Possible options are: - hanning

  • use_rfft (bool) – if set to true a real input signal is expected and only the significant half of the FFT bins are returned

  • nr_of_channels (int) – number of input channels

  • pad_last_frame (bool) – padding of last frame with zeros or discarding of last frame

layer_class: Optional[str] = 'multichannel_multiresolution_stft_layer'[source]#
recurrent = True[source]#
classmethod get_out_data_from_opts(fft_sizes, use_rfft=True, nr_of_channels=1, **kwargs)[source]#
Parameters:
  • fft_sizes (list[int]) –

  • use_rfft (bool) –

  • nr_of_channels (int) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.MultiChannelStftLayer(frame_shift, frame_size, fft_size, window='hanning', use_rfft=True, nr_of_channels=1, pad_last_frame=False, **kwargs)[source]#

The layer applys a STFT to every channel separately and concatenates the frequency domain vectors for every frame.

Parameters:
  • frame_shift (int) – frame shift for stft in samples

  • frame_sizes (list[int]) – frame size for stft in samples

  • fft_sizes (list[int]) – fft size in samples

  • window (str) – id of the windowing function used. Possible options are: - hanning

  • use_rfft (bool) – if set to true a real input signal is expected and only the significant half of the FFT bins are returned

  • nr_of_channels (int) – number of input channels

  • pad_last_frame (bool) – padding of last frame with zeros or discarding of last frame

recurrent = True[source]#
layer_class: Optional[str] = 'multichannel_stft_layer'[source]#
classmethod get_out_data_from_opts(fft_size, use_rfft=True, nr_of_channels=1, **kwargs)[source]#
Parameters:
  • fft_size (int) –

  • use_rfft (bool) –

  • nr_of_channels (int) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
input_data: Optional[Data][source]#
class returnn.tf.layers.signal_processing.StftLayer(frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, use_time_mask=False, **kwargs)[source]#

A generic STFT layer.

Parameters:
  • frame_shift (int) – frame shift for STFT

  • frame_size (int) – frame size for STFT

  • fft_size (Optional[int]) – size of the FFT to apply

  • in_spatial_dims (list[Dim|str]|None) –

  • out_spatial_dims (list[Dim]|None) –

  • out_dim (Dim|None) –

  • use_time_mask (bool) –

layer_class: Optional[str] = 'stft'[source]#
recurrent = True[source]#
classmethod get_out_data_from_opts(name, sources, network, frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, **kwargs)[source]#
Parameters:
  • name (str) –

  • sources (list[LayerBase]) –

  • network (returnn.tf.network.TFNetwork) –

  • frame_shift (int) – frame shift for STFT

  • frame_size (int) – frame size for STFT

  • fft_size (Optional[int]) – size of the FFT to apply

  • in_spatial_dims (list[Dim|str]|None) –

  • out_spatial_dims (list[Dim]|None) –

  • out_dim (Dim|None) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.IstftLayer(frame_shift, frame_size, fft_size=None, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, use_time_mask=False, **kwargs)[source]#

A generic iSTFT layer.

Parameters:
  • frame_shift (int) – frame shift for STFT

  • frame_size (int) – frame size for STFT

  • fft_size (Optional[int]) – size of the FFT to apply

  • in_spatial_dims (list[Dim|str]|None) –

  • out_spatial_dims (list[Dim]|None) –

  • out_dim (Dim|None) –

  • use_time_mask (bool) –

layer_class: Optional[str] = 'istft'[source]#
recurrent = True[source]#
classmethod get_out_data_from_opts(name, sources, network, frame_shift, frame_size, in_spatial_dims=None, out_spatial_dims=None, out_dim=None, **kwargs)[source]#
Parameters:
  • name (str) –

  • sources (list[LayerBase]) –

  • network (returnn.tf.network.TFNetwork) –

  • frame_shift (int) – frame shift for STFT

  • frame_size (int) – frame size for STFT

  • in_spatial_dims (list[Dim|str]|None) –

  • out_spatial_dims (list[Dim]|None) –

  • out_dim (Dim|None) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.NoiseEstimationByFirstTFramesLayer(nr_of_frames, **kwargs)[source]#

Estimates noise from the first t time frames.

Parameters:

nr_of_frames (int) – first nr_of_frames frames are used for averaging all frames are used if nr_of_frames is -1

layer_class: Optional[str] = 'first_t_frames_noise_estimator'[source]#
recurrent = True[source]#
kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.ParametricWienerFilterLayer(l_overwrite=None, p_overwrite=None, q_overwrite=None, filter_input=None, parameters=None, noise_estimation=None, average_parameters=False, **kwargs)[source]#

Parametric Wiener Filter For related paper, see: Menne, Tobias, Ralf Schlueter, and Hermann Ney. “Investigation into joint optimization of single channel speech enhancement and acoustic modeling for robust ASR.” IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

Parameters:
  • l_overwrite (float|None) – if given overwrites the l value of the parametric wiener filter with given constant

  • p_overwrite (float|None) – if given overwrites the p value of the parametric wiener filter with given constant

  • q_overwrite (float|None) – if given overwrites the q value of the parametric wiener filter with given constant

  • filter_input (LayerBase|None) – name of layer containing input for wiener filter

  • parameters (LayerBase|None) – name of layer containing parameters for wiener filter

  • noise_estimation (LayerBase|None) – name of layer containing noise estimate for wiener filter

  • average_parameters (bool) – if set to true the parameters l, p and q are averaged over the time axis

layer_class: Optional[str] = 'parametric_wiener_filter'[source]#
classmethod get_out_data_from_opts(**kwargs)[source]#
Return type:

Data

classmethod transform_config_dict(d, network, get_layer)[source]#
Parameters:
  • d (dict[str]) –

  • network (TFNetwork.TFNetwork) –

  • get_layer (((str) -> LayerBase)) –

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.SignalMaskingLayer(signal, mask, **kwargs)[source]#

Mask a given signal using a given mask.

Parameters:
  • signal (LayerBase) – name of layer the signal to be masked

  • mask (LayerBase) – name of layer containing the mask

layer_class: Optional[str] = 'signal_masking'[source]#
classmethod transform_config_dict(d, network, get_layer)[source]#
Parameters:
  • d (dict[str]) –

  • network (TFNetwork.TFNetwork) –

  • get_layer (((str) -> LayerBase)) –

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.SplitConcatMultiChannel(nr_of_channels=1, **kwargs)[source]#

This layer assumes the feature vector to be a concatenation of features of multiple channels (of the same size). It splits the feature dimension into equisized number of channel features and stacks them in the batch dimension. Thus the batch size is multiplied with the number of channels and the feature size is divided by the number of channels. The channels of one singal will have consecutive batch indices, meaning the signal of the original batch index n is split and can now be found in batch indices (n * nr_of_channels) to ((n+1) * nr_of_channels - 1)

Parameters:

nr_of_channels (int) – the number of concatenated channels in the feature dimension

layer_class: Optional[str] = 'split_concatenated_multichannel'[source]#
classmethod get_out_data_from_opts(name, sources, nr_of_channels, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
class returnn.tf.layers.signal_processing.TileFeaturesLayer(repetitions=1, **kwargs)[source]#

This function is tiling features with given number of repetitions.

Parameters:

repetitions (int) – number of tiling repetitions in feature domain

layer_class: Optional[str] = 'tile_features'[source]#
kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
classmethod get_out_data_from_opts(name, sources, repetitions, n_out=None, **kwargs)[source]#
Parameters:
Return type:

Data