Normalization functions such as batch norm

returnn.frontend.normalization.moments(x: Tensor, axis: Dim | Sequence[Dim], *, use_mask: bool = True, correction: int | float | Tensor = 0) Tuple[Tensor, Tensor][source]
  • x – input

  • axis – the axis (or axes) to be reduced, to calculate statistics over

  • use_mask – whether to use a mask for dynamic spatial dims in the reduction

  • correction

    The variance will be estimated by sum((x - mean)**2) / (n-correction) where n is the number of elements in the axis (or the axes) (with use_mask=True, taking masking into account, using num_elements_of_shape()). The default correction=0 will return the biased variance estimation. correction=1 is the Bessel correction and will return the unbiased variance estimation. In PyTorch, there was an argument unbiased for this, but this changed recently to correction (PyTorch issue #61492,

    In PyTorch, the default is correction=1, which is the unbiased variance estimation, while in most other frameworks, the default is correction=0, which is the biased variance estimation.


tuple (mean, variance). it has the same shape as the input with the axis removed

class returnn.frontend.normalization.LayerNorm(in_dim: Dim | Sequence[Dim], *, eps: float = 1e-06)[source]

Layer normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see norm().

By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to __call__().

class returnn.frontend.normalization.BatchNorm(in_dim: Dim, *, affine: bool = True, momentum: float = 0.1, eps: float = 0.001, track_running_stats: bool = True, use_mask: bool | None = None)[source]

Batch normalization.

Note that the default arguments differ from corresponding batch norm in RETURNN. See here for discussion on defaults:

We calculate statistics over all axes except the given in_dim. I.e. all other axes are reduced for the statistics.

To compensate the normalization, there are learnable parameters gamma and beta (optional, used when option affine is True).

The usual behavior depends on whether this is used in training or evaluation, although this often configurable in other frameworks. The usual behavior, in training:

# Using statistics from current batch.
mean_cur_batch, variance_cur_batch = moments(source, reduce_dims)
y = (x - mean_cur_batch) / sqrt(variance_cur_batch + epsilon)
y = gamma * y + beta

# Updating running statistics for later use.
mean = (1 - momentum) * mean + momentum * mean_cur_batch
variance = (1 - momentum) * variance + momentum * variance_cur_batch

The usual behavior, not in training (i.e. in evaluation):

# Using collected statistics. Not using statistics from current batch.
y = (x - mean) / sqrt(variance + epsilon)
y = gamma * y + beta
  • in_dim – the feature dimension of the input

  • affine – whether to use learnable parameters gamma and beta

  • momentum – momentum for the running mean and variance

  • eps – epsilon for the variance

  • track_running_stats – If True, uses statistics of the current batch for normalization during training, and the tracked statistics (running mean and variance) during evaluation. If False, uses statistics of the current batch for normalization during both training and evaluation.

  • use_mask

    whether to use a mask for dynamic spatial dims. This must be specified if the input has dynamic spatial dims. True would use the correct masking then. However, that is inconsistent to all other frameworks

    which ignore the masking, and also slower, and the fused op would not be used.

    False would be consistent to all other frameworks,

    and potentially allows for the use of an efficient fused op internally.

returnn.frontend.normalization.normalize(a: Tensor, *, axis: Dim | Sequence[Dim], epsilon: float = 1e-06) Tensor[source]

Mean- and variance-normalize some input in the given input dimension(s), such that the resulting tensor has mean 0 and variance 1.

If you want that this can be shifted and scaled again, you need additional parameters, cf. Normalize.

  • a – input

  • axis – axis over which the mean and variance are computed

  • epsilon – epsilon for numerical stability


(a - mean) / sqrt(variance + epsilon)

class returnn.frontend.normalization.Normalize(*, param_dims: Dim | Sequence[Dim], epsilon: float = 1e-06, scale: bool = True, bias: bool = True)[source]

normalize() with additional scale and bias

  • param_dims – shape of the scale and bias parameters

  • epsilon – epsilon for numerical stability

  • scale – whether to include a trainable scale

  • bias – whether to include a trainable bias