returnn.frontend.normalization

Normalization functions such as batch norm

returnn.frontend.normalization.moments(x: Tensor, axis: Dim | Sequence[Dim]) Tuple[Tensor, Tensor][source]
Parameters:
  • x – input

  • axis – the axis to be reduced, to calculate statistics over

Returns:

mean, variance. it has the same shape as the input with the axis removed

class returnn.frontend.normalization.LayerNorm(in_dim: Dim | Sequence[Dim], *, eps: float = 1e-06)[source]

Layer normalization.

Note that we just normalize over the feature-dim axis here. This is consistent to the default behavior of tf.keras.layers.LayerNormalization and also how it is commonly used in many models, including Transformer.

However, there are cases where it would be common to normalize over all axes except batch-dim, or all axes except batch and time. For a more generic variant, see norm().

By convention, any options to the module are passed to __init__, and potential changing inputs (other tensors) are passed to __call__().

class returnn.frontend.normalization.BatchNorm(in_dim: Dim, *, affine: bool = True, momentum: float = 0.1, eps: float = 0.001, track_running_stats: bool = True, use_mask: bool | None = None)[source]

Batch normalization. https://arxiv.org/abs/1502.03167

Note that the default arguments differ from corresponding batch norm in RETURNN. See here for discussion on defaults: https://github.com/rwth-i6/returnn/issues/522

We calculate statistics over all axes except the given in_dim. I.e. all other axes are reduced for the statistics.

To compensate the normalization, there are learnable parameters gamma and beta (optional, used when option affine is True).

The usual behavior depends on whether this is used in training or evaluation, although this often configurable in other frameworks. The usual behavior, in training:

# Using statistics from current batch.
mean_cur_batch, variance_cur_batch = moments(source, reduce_dims)
y = (x - mean_cur_batch) / sqrt(variance_cur_batch + epsilon)
y = gamma * y + beta

# Updating running statistics for later use.
mean = (1 - momentum) * mean + momentum * mean_cur_batch
variance = (1 - momentum) * variance + momentum * variance_cur_batch

The usual behavior, not in training (i.e. in evaluation):

# Using collected statistics. Not using statistics from current batch.
y = (x - mean) / sqrt(variance + epsilon)
y = gamma * y + beta
Parameters:
  • in_dim – the feature dimension of the input

  • affine – whether to use learnable parameters gamma and beta

  • momentum – momentum for the running mean and variance

  • eps – epsilon for the variance

  • track_running_stats – If True, uses statistics of the current batch for normalization during training, and the tracked statistics (running mean and variance) during evaluation. If False, uses statistics of the current batch for normalization during both training and evaluation.

  • use_mask

    whether to use a mask for dynamic spatial dims. This must be specified if the input has dynamic spatial dims. True would use the correct masking then. However, that is inconsistent to all other frameworks

    which ignore the masking, and also slower, and the fused op would not be used.

    False would be consistent to all other frameworks,

    and potentially allows for the use of an efficient fused op internally.

returnn.frontend.normalization.normalize(a: Tensor, *, axis: Dim | Sequence[Dim], epsilon: float = 1e-06) Tensor[source]

Mean- and variance-normalize some input in the given input dimension(s), such that the resulting tensor has mean 0 and variance 1.

If you want that this can be shifted and scaled again, you need additional parameters, cf. Normalize.

Parameters:
  • a – input

  • axis – axis over which the mean and variance are computed

  • epsilon – epsilon for numerical stability

Returns:

(a - mean) / sqrt(variance + epsilon)

class returnn.frontend.normalization.Normalize(*, param_dims: Dim | Sequence[Dim], epsilon: float = 1e-06, scale: bool = True, bias: bool = True)[source]

normalize() with additional scale and bias

Parameters:
  • param_dims – shape of the scale and bias parameters

  • epsilon – epsilon for numerical stability

  • scale – whether to include a trainable scale

  • bias – whether to include a trainable bias