Normalization Layers

Generic Normalization Layer

class returnn.tf.layers.basic.NormLayer(axes, param_shape='F', scale=True, bias=True, epsilon=1e-06, **kwargs)[source]

Normalize over specified axes, e.g. time and/or feature axis.

Note: For calculating a norm, see MathNormLayer instead.

In case of just feature (axes="F"), this corresponds to layer normalization (see LayerNormLayer). In case of time and feature (axes="TF") for a 3D input, or more general all except batch (axes="except_batch"), this corresponds to group normalization with G=1, or non-standard layer normalization. (The definition of layer-normalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)

The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see BatchNormLayer).

For some discussion on the definition of layer-norm vs group-norm, also see here and here.

Parameters:
  • axes (str|list[str]) – axes over which the mean and variance are computed, e.g. “F” or “TF”
  • param_shape (str|list[str]|tuple[str]|int|list[int]|tuple[int]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the feature-dim. This is also the default, i.e. a param-shape of [F], independent of the axes to normalize over.
  • scale (bool) – add trainable scale parameters
  • bias (bool) – add trainable bias parameters
  • epsilon (float) – epsilon for numerical stability
layer_class = 'norm'[source]
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
  • sources (list[LayerBase]) –
  • name (str) –
Return type:

Data

Batch-Normalization Layer

class returnn.tf.layers.basic.BatchNormLayer(use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]

Implements batch-normalization (http://arxiv.org/abs/1502.03167) as a separate layer.

Also see NormLayer.

Parameters:
  • use_shift (bool) –
  • use_std (bool) –
  • use_sample (float) – defaults to 0.0 which is used in training
  • force_sample (bool) – even in eval, use the use_sample factor
  • momentum (float) – for the running average of sample_mean and sample_std
  • update_sample_only_in_training (bool) –
  • delay_sample_update (bool) –
  • param_version (int) – 0 or 1
  • epsilon (float) –
  • gamma_init (str|float) – see TFUtil.get_initializer(), for the scale
  • beta_init (str|float) – see TFUtil.get_initializer(), for the mean
  • masked_time (bool) – flatten and mask input tensor

The default settings for these variables are set in the function “batch_norm” of the LayerBase. If you do not want to change them you can leave them undefined here. With our default settings:

  • In training: use_sample=0, i.e. not using running average, using current batch mean/var.
  • Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.
  • The running average includes the statistics of the current batch.
  • The running average is also updated when not training.
layer_class = 'batch_norm'[source]

Layer-Normalization Layer

class returnn.tf.layers.basic.LayerNormLayer(epsilon=1e-06, **kwargs)[source]

Applies layer-normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see NormLayer.

Parameters:epsilon (float) –
layer_class = 'layer_norm'[source]
classmethod get_out_data_from_opts(sources, name, **kwargs)[source]
Parameters:
  • sources (list[LayerBase]) –
  • name (str) –
Return type:

Data