Label smoothing

returnn.frontend.label_smoothing.label_smoothing(prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None) Tensor[source]

Label smoothing, often used for cross entropy.

In case of sparse data, it will become dense (via smooth_one_hot()) and the target label will get probability (1 - smoothing).

returnn.frontend.label_smoothing.smooth_one_hot(source: Tensor, *, label_prob: Tensor | float) Tensor[source]

Smooth variant of one_hot(). Uses label_prob for the labels and (1 - label_prob) / (dim - 1) for the remaining values. This is used for label smoothing.

returnn.frontend.label_smoothing.label_smoothed_log_prob_gradient(log_prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None, exclude_labels: Sequence[int] | None = None) Tensor[source]
  • log_prob – shape […,D] (not necessarily the same as loss)

  • smoothing – smoothing factor, for label_smoothing()

  • axis – label axis. uses feature_dim by default

  • exclude_labels – list of labels to exclude from smoothing (e.g. blank)

Assume some cross-entropy-like loss:

loss = - sum_i target_prob[i] * log_prob[i] .

The sum is over the label indices i (corresponding to the axis argument). Then the gradient of loss w.r.t. log_prob[i] is:

grad_logprob[i] loss = -target_prob[i] .

We assume that the negative gradient is a probability distribution, and apply label_smoothing() on it. More specifically, we apply the same scale and shift as in the label_smoothing() function via scaled_gradient().

Just as a side remark: assume

log_prob = log_softmax(z) .

The gradient of log_softmax is:

grad_z[j] log_prob[i] = delta(i==j) - softmax(z)[j] .

Then the gradient w.r.t. z[j] is:

grad_z[j] loss = sum_i (grad_logprob[i] loss) (grad_z[j] logprob[i])

= sum_i -target_prob[i] delta(i==j) + target_prob[i] softmax(z)[j] = -target_prob[j] + (sum_i target_prob[i]) softmax(z)[j] = softmax(z)[j] - target_prob[j] # assuming (sum_i target_prob[i]) == 1