Normalization Layers¶
Generic Normalization Layer¶

class
returnn.tf.layers.basic.
NormLayer
(axes, param_shape='F', scale=True, bias=True, epsilon=1e06, **kwargs)[source]¶ Normalize over specified axes, e.g. time and/or feature axis.
Note: For calculating a norm, see
MathNormLayer
instead.In case of just feature (
axes="F"
), this corresponds to layer normalization (seeLayerNormLayer
). In case of time and feature (axes="TF"
) for a 3D input, or more general all except batch (axes="except_batch"
), this corresponds to group normalization with G=1, or nonstandard layer normalization. (The definition of layernormalization is not clear on what axes should be normalized over. In many other frameworks, the default axis is just the last axis, which is usually the feature axis. However, in certain implementations and models, it is also common to normalize over all axes except batch.)The statistics are calculated just on the input. There are no running statistics (in contrast to batch normalization, see
BatchNormLayer
).For some discussion on the definition of layernorm vs groupnorm, also see here and here.
Parameters:  axes (strlist[str]) – axes over which the mean and variance are computed, e.g. “F” or “TF”
 param_shape (strlist[str]tuple[str]intlist[int]tuple[int]) – shape of the scale and bias parameters. You can also refer to (static) axes of the input, such as the featuredim. This is also the default, i.e. a paramshape of [F], independent of the axes to normalize over.
 scale (bool) – add trainable scale parameters
 bias (bool) – add trainable bias parameters
 epsilon (float) – epsilon for numerical stability
BatchNormalization Layer¶

class
returnn.tf.layers.basic.
BatchNormLayer
(use_shift=<class 'returnn.util.basic.NotSpecified'>, use_std=<class 'returnn.util.basic.NotSpecified'>, use_sample=<class 'returnn.util.basic.NotSpecified'>, force_sample=<class 'returnn.util.basic.NotSpecified'>, momentum=<class 'returnn.util.basic.NotSpecified'>, epsilon=<class 'returnn.util.basic.NotSpecified'>, update_sample_only_in_training=<class 'returnn.util.basic.NotSpecified'>, delay_sample_update=<class 'returnn.util.basic.NotSpecified'>, param_version=<class 'returnn.util.basic.NotSpecified'>, gamma_init=<class 'returnn.util.basic.NotSpecified'>, beta_init=<class 'returnn.util.basic.NotSpecified'>, masked_time=<class 'returnn.util.basic.NotSpecified'>, **kwargs)[source]¶ Implements batchnormalization (http://arxiv.org/abs/1502.03167) as a separate layer.
Also see
NormLayer
.Parameters:  use_shift (bool) –
 use_std (bool) –
 use_sample (float) – defaults to 0.0 which is used in training
 force_sample (bool) – even in eval, use the use_sample factor
 momentum (float) – for the running average of sample_mean and sample_std
 update_sample_only_in_training (bool) –
 delay_sample_update (bool) –
 param_version (int) – 0 or 1
 epsilon (float) –
 gamma_init (strfloat) – see
TFUtil.get_initializer()
, for the scale  beta_init (strfloat) – see
TFUtil.get_initializer()
, for the mean  masked_time (bool) – flatten and mask input tensor
The default settings for these variables are set in the function “batch_norm” of the LayerBase. If you do not want to change them you can leave them undefined here. With our default settings:
 In training: use_sample=0, i.e. not using running average, using current batch mean/var.
 Not in training (e.g. eval): use_sample=1, i.e. using running average, not using current batch mean/var.
 The running average includes the statistics of the current batch.
 The running average is also updated when not training.
LayerNormalization Layer¶

class
returnn.tf.layers.basic.
LayerNormLayer
(epsilon=1e06, **kwargs)[source]¶ Applies layernormalization.
Note that we just normalize over the featuredim axis here. This is consistent to the default behavior of
tf.keras.layers.LayerNormalization
and also how it is commonly used in many models, including Transformer.However, there are cases where it would be common to normalize over all axes except batchdim, or all axes except batch and time. For a more generic variant, see
NormLayer
.Parameters: epsilon (float) –