# `returnn.frontend.label_smoothing`¶

Label smoothing

returnn.frontend.label_smoothing.label_smoothing(prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None) [source]

Label smoothing, often used for cross entropy.

In case of sparse data, it will become dense (via `smooth_one_hot()`) and the target label will get probability (1 - smoothing).

returnn.frontend.label_smoothing.smooth_one_hot(source: Tensor, *, label_prob: Tensor | float) [source]

Smooth variant of `one_hot()`. Uses `label_prob` for the labels and `(1 - label_prob) / (dim - 1)` for the remaining values. This is used for label smoothing.

returnn.frontend.label_smoothing.label_smoothed_log_prob_gradient(log_prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None, exclude_labels: Sequence[int] | None = None) [source]
Parameters:
• log_prob – shape […,D] (not necessarily the same as loss)

• smoothing – smoothing factor, for `label_smoothing()`

• axis – label axis. uses feature_dim by default

• exclude_labels – list of labels to exclude from smoothing (e.g. blank)

Assume some cross-entropy-like loss:

loss = - sum_i target_prob[i] * log_prob[i] .

The sum is over the label indices i (corresponding to the `axis` argument). Then the gradient of loss w.r.t. log_prob[i] is:

We assume that the negative gradient is a probability distribution, and apply `label_smoothing()` on it. More specifically, we apply the same scale and shift as in the `label_smoothing()` function via `scaled_gradient()`.

Just as a side remark: assume

log_prob = log_softmax(z) .