Normalization Layers#

Generic Normalization Layer#

class returnn.tf.layers.basic.NormLayer(axis=<class 'returnn.util.basic.NotSpecified'>, axes=<class 'returnn.util.basic.NotSpecified'>, param_shape=<class 'returnn.util.basic.NotSpecified'>, scale=True, bias=True, epsilon=1e-06, **kwargs)[source]#

Normalize over specified axes, e.g. time and/or feature axis.

Note: For calculating a norm, see MathNormLayer instead.

In case of just feature (axes="F"), this corresponds to layer normalization (see LayerNormLayer). In case of time and feature (axes="TF") for a 3D input, or more general all except batch (axes="except_batch"), this corresponds to group normalization with G=1, or non-standard layer normalization. (The definition of layer-normalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)

The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see BatchNormLayer).

For some discussion on the definition of layer-norm vs group-norm, also see here and here.

Parameters:
  • axis (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”

  • axes (Dim|str|list[Dim|str]) – axis or axes over which the mean and variance are computed, e.g. “F” or “TF”

  • param_shape (Dim|str|list[Dim|str]|tuple[Dim|str]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the feature-dim. This is also the default, i.e. a param-shape of [F], independent of the axes to normalize over.

  • scale (bool) – add trainable scale parameters

  • bias (bool) – add trainable bias parameters

  • epsilon (float) – epsilon for numerical stability

layer_class: Optional[str] = 'norm'[source]#
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]#
Parameters:
  • sources (list[LayerBase]) –

  • name (str) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
input_data: Optional[Data][source]#

Batch-Normalization Layer#

class returnn.tf.layers.basic.BatchNormLayer(in_dim=None, use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]#

Implements batch-normalization (https://arxiv.org/abs/1502.03167) as a separate layer.

Also see NormLayer.

Parameters:
  • in_dim (returnn.tensor.Dim|None) –

  • use_shift (bool) –

  • use_std (bool) –

  • use_sample (float) – defaults to 0.0 which is used in training

  • force_sample (bool) – even in eval, use the use_sample factor

  • momentum (float) – for the running average of sample_mean and sample_std

  • update_sample_only_in_training (bool) –

  • delay_sample_update (bool) –

  • param_version (int) – 0 or 1 or 2

  • epsilon (float) –

  • gamma_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the scale

  • beta_init (str|float) – see returnn.tf.util.basic.get_initializer(), for the mean

  • masked_time (bool) – flatten and mask input tensor

The default settings for these variables are set in the function batch_norm() of LayerBase. If you do not want to change them you can leave them undefined here. With our default settings:

  • In training: use_sample=0, i.e. not using running average, using current batch mean/var.

  • Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.

  • The running average includes the statistics of the current batch.

  • The running average is also updated when not training.

layer_class: Optional[str] = 'batch_norm'[source]#
kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
input_data: Optional[Data][source]#

Layer-Normalization Layer#

class returnn.tf.layers.basic.LayerNormLayer(in_dim=None, out_dim=None, epsilon=1e-06, **kwargs)[source]#

Applies layer-normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see NormLayer.

Parameters:
  • in_dim (Dim|None) – axis to normalize over. feature-dim by default

  • out_dim (Dim|None) – just the same as in_dim

  • epsilon (float) –

layer_class: Optional[str] = 'layer_norm'[source]#
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]#
Parameters:
  • sources (list[LayerBase]) –

  • name (str) –

Return type:

Data

kwargs: Optional[Dict[str]][source]#
output_before_activation: Optional[OutputWithActivation][source]#
output_loss: Optional[tf.Tensor][source]#
rec_vars_outputs: Dict[str, tf.Tensor][source]#
search_choices: Optional[SearchChoices][source]#
params: Dict[str, tf.Variable][source]#
saveable_param_replace: Dict[tf.Variable, Union['tensorflow.python.training.saver.BaseSaverBuilder.SaveableObject', None]][source]#
stats: Dict[str, tf.Tensor][source]#
input_data: Optional[Data][source]#