returnn.frontend.label_smoothing

Label smoothing

returnn.frontend.label_smoothing.label_smoothing(prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None) Tensor[source]

Label smoothing, often used for cross entropy.

In case of sparse data, it will become dense (via smooth_one_hot()) and the target label will get probability (1 - smoothing).

returnn.frontend.label_smoothing.smooth_one_hot(source: Tensor, *, label_prob: Tensor | float) Tensor[source]

Smooth variant of one_hot(). Uses label_prob for the labels and (1 - label_prob) / (dim - 1) for the remaining values. This is used for label smoothing.

returnn.frontend.label_smoothing.label_smoothed_log_prob_gradient(log_prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None, exclude_labels: Sequence[int] | None = None) Tensor[source]
Parameters:
  • log_prob – shape […,D] (not necessarily the same as loss)

  • smoothing – smoothing factor, for label_smoothing()

  • axis – label axis in log_prob (D). uses feature_dim by default

  • exclude_labels – list of labels to exclude from smoothing (e.g. blank)

Returns:

log_prob, but the gradient is smoothed

Assume some cross-entropy-like loss:

loss = - sum_i target_prob[i] * log_prob[i] .

The sum is over the label indices i (corresponding to the axis argument). Then the gradient of loss w.r.t. log_prob[i] is:

grad_logprob[i] loss = -target_prob[i] .

We assume that the negative gradient is a probability distribution (potentially scaled by some factor, e.g. when you scale the loss by some factor) and apply label_smoothing() on it. More specifically, we apply the same scale and shift as in the label_smoothing() function via scaled_gradient_ext().

Note, this is also the case for CTC or RNNT loss, that the negative gradient of the loss w.r.t. the log-probabilities is a probability distribution.

Common usage example:

# E.g. there was some log_softmax, or anything to get log probs.
log_probs = model.get_log_probs(...)

# Now apply label smoothing on the log prob gradients.
log_probs = rf.label_smoothed_log_prob_gradient(log_probs, 0.1)

# E.g. CE, CTC, or similar, any kind of NLL should work.
loss = loss_func(log_probs, targets)
loss.sum().backward()

Just as a side remark: assume:

log_prob = log_softmax(z) .

The gradient of log_softmax is:

grad_z[j] log_prob[i] = delta(i==j) - softmax(z)[j] .

Then the gradient w.r.t. z[j] is:

grad_z[j] loss = sum_i (grad_logprob[i] loss) (grad_z[j] logprob[i])
               = sum_i -target_prob[i] delta(i==j) + target_prob[i] softmax(z)[j]
               = -target_prob[j] + (sum_i target_prob[i]) softmax(z)[j]
               = softmax(z)[j] - target_prob[j]    # assuming (sum_i target_prob[i]) == 1