returnn.frontend.label_smoothing
#
Label smoothing
- returnn.frontend.label_smoothing.label_smoothing(prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None) Tensor [source]#
Label smoothing, often used for cross entropy.
In case of sparse data, it will become dense (via
smooth_one_hot()
) and the target label will get probability (1 - smoothing).
- returnn.frontend.label_smoothing.smooth_one_hot(source: Tensor, *, label_prob: Tensor | float) Tensor [source]#
Smooth variant of
one_hot()
. Useslabel_prob
for the labels and(1 - label_prob) / (dim - 1)
for the remaining values. This is used for label smoothing.
- returnn.frontend.label_smoothing.label_smoothed_log_prob_gradient(log_prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None, exclude_labels: Sequence[int] | None = None) Tensor [source]#
- Parameters:
log_prob – shape […,D] (not necessarily the same as loss)
smoothing – smoothing factor, for
label_smoothing()
axis – label axis. uses feature_dim by default
exclude_labels – list of labels to exclude from smoothing (e.g. blank)
Assume some cross-entropy-like loss:
loss = - sum_i target_prob[i] * log_prob[i] .
The sum is over the label indices i (corresponding to the
axis
argument). Then the gradient of loss w.r.t. log_prob[i] is:grad_logprob[i] loss = -target_prob[i] .
We assume that the negative gradient is a probability distribution, and apply
label_smoothing()
on it. More specifically, we apply the same scale and shift as in thelabel_smoothing()
function viascaled_gradient()
.Just as a side remark: assume
log_prob = log_softmax(z) .
The gradient of log_softmax is:
grad_z[j] log_prob[i] = delta(i==j) - softmax(z)[j] .
Then the gradient w.r.t. z[j] is:
- grad_z[j] loss = sum_i (grad_logprob[i] loss) (grad_z[j] logprob[i])
= sum_i -target_prob[i] delta(i==j) + target_prob[i] softmax(z)[j] = -target_prob[j] + (sum_i target_prob[i]) softmax(z)[j] = softmax(z)[j] - target_prob[j] # assuming (sum_i target_prob[i]) == 1