returnn.frontend.label_smoothing
¶
Label smoothing
- returnn.frontend.label_smoothing.label_smoothing(prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None) Tensor [source]¶
Label smoothing, often used for cross entropy.
In case of sparse data, it will become dense (via
smooth_one_hot()
) and the target label will get probability (1 - smoothing).
- returnn.frontend.label_smoothing.smooth_one_hot(source: Tensor, *, label_prob: Tensor | float) Tensor [source]¶
Smooth variant of
one_hot()
. Useslabel_prob
for the labels and(1 - label_prob) / (dim - 1)
for the remaining values. This is used for label smoothing.
- returnn.frontend.label_smoothing.label_smoothed_log_prob_gradient(log_prob: Tensor, smoothing: Tensor | float, *, axis: Dim | None = None, exclude_labels: Sequence[int] | None = None) Tensor [source]¶
- Parameters:
log_prob – shape […,D] (not necessarily the same as loss)
smoothing – smoothing factor, for
label_smoothing()
axis – label axis in
log_prob
(D). uses feature_dim by defaultexclude_labels – list of labels to exclude from smoothing (e.g. blank)
- Returns:
log_prob
, but the gradient is smoothed
Assume some cross-entropy-like loss:
loss = - sum_i target_prob[i] * log_prob[i] .
The sum is over the label indices i (corresponding to the
axis
argument). Then the gradient of loss w.r.t. log_prob[i] is:grad_logprob[i] loss = -target_prob[i] .
We assume that the negative gradient is a probability distribution (potentially scaled by some factor, e.g. when you scale the loss by some factor) and apply
label_smoothing()
on it. More specifically, we apply the same scale and shift as in thelabel_smoothing()
function viascaled_gradient_ext()
.Note, this is also the case for CTC or RNNT loss, that the negative gradient of the loss w.r.t. the log-probabilities is a probability distribution.
Common usage example:
# E.g. there was some log_softmax, or anything to get log probs. log_probs = model.get_log_probs(...) # Now apply label smoothing on the log prob gradients. log_probs = rf.label_smoothed_log_prob_gradient(log_probs, 0.1) # E.g. CE, CTC, or similar, any kind of NLL should work. loss = loss_func(log_probs, targets) loss.sum().backward()
Just as a side remark: assume:
log_prob = log_softmax(z) .
The gradient of log_softmax is:
grad_z[j] log_prob[i] = delta(i==j) - softmax(z)[j] .
Then the gradient w.r.t. z[j] is:
grad_z[j] loss = sum_i (grad_logprob[i] loss) (grad_z[j] logprob[i]) = sum_i -target_prob[i] delta(i==j) + target_prob[i] softmax(z)[j] = -target_prob[j] + (sum_i target_prob[i]) softmax(z)[j] = softmax(z)[j] - target_prob[j] # assuming (sum_i target_prob[i]) == 1