Loss Functions¶
This is a list of all loss functions that can be used by adding "loss": "<class_name_of_loss>"
to a layer.
Additional input parameters to the respective loss classes can be given via loss_opts
.
A scale for a loss can be set via loss_scale
(also see Defining Layers)
If the output of a loss function is needed as a part of the network,
the LossLayer
can be used in combination with one of the losses.
LossLayer¶
- class returnn.tf.layers.basic.LossLayer(loss_, target_=None, use_error=False, **kwargs)[source]¶
This layers wraps a
Loss
calculation as a layer. I.e. the loss will be calculated and returned by the layer. But this loss will not be used as a loss by the updater. If you want to use it as a loss, you can use theAsIsLoss
, i.e. write"loss": "as_is"
.Note that the loss options for the wrapped loss need to be provided via
loss_opts_
, and it does not apply any reduce function.Note
The
LossLayer
might be deprecated in the future in favor of implementing the losses as actual layers.If you want to define a loss inside the network, it is recommended to define it explicitly. An example could be:
"se_loss": {"class": "eval", "eval": "(source(0) - source(1)) ** 2", "from": ["output", "data:classes"]}
Followed by an e.g. mean reduce if needed:
"mse_loss": {"class": "reduce", "mode": "mean", "axis": "F", "from": "se_loss"}
loss_
and related params have the postfix_
to distinguish them from the loss options, which are used by the network and updater for training. Some of these (e.g.loss_opts_
) are handled intransform_config_dict()
.- Parameters:
- get_sub_layer(layer_name)[source]¶
- Parameters:
layer_name (str) – sub layer name
- Return type:
LayerBase|None
- classmethod get_available_sub_layer_names(parent_layer_kwargs)[source]¶
- Parameters:
parent_layer_kwargs (dict[str])
- Return type:
list[str]
- classmethod get_sub_layer_out_data_from_opts(layer_name, parent_layer_kwargs)[source]¶
- Parameters:
layer_name (str) – sub layer name
parent_layer_kwargs (dict[str])
- Returns:
Data template, class type of sub-layer, layer opts (transformed)
- Return type:
(Data, type, dict[str])|None
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- output_before_activation: Optional[OutputWithActivation][source]¶
- search_choices: Optional[SearchChoices][source]¶
As-Is Loss¶
- class returnn.tf.layers.basic.AsIsLoss(as_error=False, **kwargs)[source]¶
Use the output as-is as the loss.
Also see
ViaLayerLoss
which also allows to define a custom error signal (gradient).- Parameters:
as_error (bool) – if True, use the output as error, otherwise (default) use the output as loss value. Error is purely for reporting, loss value is used for the optimizer as well (when scale != 0).
- output_with_activation: OutputWithActivation | None[source]¶
Binary Cross-Entropy Loss¶
- class returnn.tf.layers.basic.BinaryCrossEntropyLoss(pos_weight=None, **kwargs)[source]¶
Binary cross entropy. We expect the output as logits, not in probability space! Per frame: mean(target * log(sigmoid(output)) + (1 - target) * log(1 - sigmoid(output)))
- Parameters:
pos_weight (float|None) – weight of positive labels, see tf.nn.weighted_cross_entropy_with_logits.
- get_error()[source]¶
- Returns:
frame error rate as a scalar value with the default self.reduce_func (see also self.get_value)
- Return type:
tf.Tensor
- output_with_activation: OutputWithActivation | None[source]¶
Bleu Loss¶
- class returnn.tf.layers.basic.BleuLoss(**kwargs)[source]¶
Note that this loss is not differentiable, thus it’s only for keeping statistics. Also, BLEU is a score, i.e. the higher, the better. Thus, to interpret it as a loss or error, we take the negative value.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
- output_with_activation: OutputWithActivation | None[source]¶
Cross-Entropy Loss¶
- class returnn.tf.layers.basic.CrossEntropyLoss(input_type='prob', focal_loss_factor=0.0, label_smoothing=0.0, label_smoothing_gaussian=False, debug_dump=False, safe_log_opts=None, use_fused=True, fake_upper_bound=None, **kwargs)[source]¶
Cross-Entropy loss. Basically sum(target * log(output)).
- Parameters:
input_type (str) – “prob” (default) or “logits”
focal_loss_factor (float) – see https://arxiv.org/abs/1708.02002. 0 means disabled
label_smoothing (float) – 0.1 is a common default. see
returnn.tf.util.basic.smoothing_cross_entropy()
label_smoothing_gaussian (bool) – see
returnn.tf.util.basic.smoothing_cross_entropy()
debug_dump (bool)
safe_log_opts (dict[str]) – passed to
safe_log()
use_fused (bool) – if possible, use fused opts
fake_upper_bound (float|None) – uses
returnn.tf.util.basic.minimum_with_identity_grad()
. I.e. you will see a finite loss, but we use the original gradient (which should be safe).
- get_output_target_scores()[source]¶
- Returns:
shape (time_flat,), type float32, std-prob space
- Return type:
tf.Tensor
- output_with_activation: OutputWithActivation | None[source]¶
CTC Loss¶
- class returnn.tf.layers.basic.CtcLoss(target_collapse_repeated=False, auto_clip_target_len=False, output_in_log_space=False, beam_width=100, ctc_opts=None, use_native=False, use_viterbi=False, **kwargs)[source]¶
Connectionist Temporal Classification (CTC) loss. Basically a wrapper around tf.nn.ctc_loss.
- Parameters:
target_collapse_repeated (bool) – like preprocess_collapse_repeated option for CTC. used for sparse_labels().
auto_clip_target_len (bool) – see self._get_target_sparse_labels().
output_in_log_space (bool) – False -> output expected in prob space. see self.get_output_logits
beam_width (int) – used in eval
ctc_opts (dict[str]|None) – other kwargs used for tf.nn.ctc_loss
use_native (bool) – use our native implementation (
TFNativeOp.ctc_loss()
)use_viterbi (bool) – instead of full-sum, use only best path (via
ctc_loss_viterbi()
)
- get_soft_alignment()[source]¶
Also called the Baum-Welch-alignment. This is basically p_t(s|x_1^T,w_1^N), where s are the output labels (including blank), and w are the real target labels.
- Returns:
shape (time, batch, dim)
- Return type:
tf.Tensor
- classmethod get_auto_output_layer_dim(target_dim)[source]¶
- Parameters:
target_dim (returnn.tensor.Dim)
- Return type:
returnn.tensor.Dim
- output_with_activation: OutputWithActivation | None[source]¶
Deep Clustering Loss¶
- class returnn.tf.layers.basic.DeepClusteringLoss(embedding_dimension, nr_of_sources, **kwargs)[source]¶
Cost function used for deep clustering as described in [Hershey & Chen+, 2016]: “Deep clustering discriminative embeddings for segmentation and separation”
- Parameters:
embedding_dimension (int)
nr_of_sources (int)
- output_with_activation: OutputWithActivation | None[source]¶
Edit Distance Loss¶
- class returnn.tf.layers.basic.EditDistanceLoss(debug_print=False, label_map=None, ctc_decode=False, output_in_log_space=False, **kwargs)[source]¶
Note that this loss is not differentiable, thus it’s only for keeping statistics.
- Parameters:
debug_print (bool) – will tf.Print the sequence
label_map (dict[int,int]|None) – before calculating the edit-distance, will apply this map
ctc_decode (bool) – True -> expects dense output and does CTC decode, False -> expects sparse labels in output
output_in_log_space (bool) – False -> dense output expected in prob space. see self.get_output_logits
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
- output_with_activation: OutputWithActivation | None[source]¶
Expected Loss¶
- class returnn.tf.layers.basic.ExpectedLoss(loss, loss_kind, norm_scores=True, norm_scores_stop_gradient=True, divide_beam_size=True, subtract_average_loss=True, loss_correction_grad_only=False, **kwargs)[source]¶
This loss uses another loss error or value and given the search beam scores, calculates the expected loss. Sometimes also called minimum Bayes risk.
- Parameters:
loss (Loss)
loss_kind (str) – “error” or “value”. whether to use loss.get_error() or loss.get_value()
norm_scores (bool)
norm_scores_stop_gradient (bool)
divide_beam_size (bool)
subtract_average_loss (bool)
loss_correction_grad_only (bool)
- search_choices: SearchChoices | None[source]¶
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str])
network (returnn.tf.network.TFNetwork)
get_layer
- output_with_activation: OutputWithActivation | None[source]¶
Extern Sprint Loss¶
- class returnn.tf.layers.basic.ExternSprintLoss(sprint_opts, **kwargs)[source]¶
The loss is calculated by an extern Sprint instance.
- Parameters:
sprint_opts (dict[str])
- output_with_activation: OutputWithActivation | None[source]¶
Fast Baum Welch Loss¶
- class returnn.tf.layers.basic.FastBaumWelchLoss(sprint_opts, tdp_scale=1.0, **kwargs)[source]¶
The loss is calculated via
fast_baum_welch()
. The automata are created by an extern Sprint instance.- Parameters:
sprint_opts (dict[str])
- output_with_activation: OutputWithActivation | None[source]¶
Generic Cross-Entropy Loss¶
- class returnn.tf.layers.basic.GenericCELoss(**kwargs)[source]¶
Some generalization of cross entropy.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- output_with_activation: OutputWithActivation | None[source]¶
Mean-L1 Loss¶
- class returnn.tf.layers.basic.MeanL1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
Like MSE loss, but with absolute difference
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- output_with_activation: OutputWithActivation | None[source]¶
Mean-Squared-Error Loss¶
- class returnn.tf.layers.basic.MeanSquaredError(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
The generic mean squared error loss function
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- output_with_activation: OutputWithActivation | None[source]¶
L1 Loss¶
- class returnn.tf.layers.basic.L1Loss(base_network, use_flatten_frames=True, use_normalized_loss=False, custom_norm_factor=None, custom_inv_norm_factor=None, scale=1.0, _check_output_before_softmax=None)[source]¶
L1-distance loss. sum(target - output).
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- output_with_activation: OutputWithActivation | None[source]¶
Sampling-Based Loss¶
- class returnn.tf.layers.basic.SamplingBasedLoss(num_sampled=128, num_splits=1, sampler='log_uniform', nce_loss=False, use_full_softmax=False, remove_accidental_hits=None, sampler_args=None, nce_log_norm_term=0.0, **kwargs)[source]¶
Implement two sampling based losses, sampled softmax (default) and noise contrastive estimation. https://www.tensorflow.org/api_docs/python/tf/nn/sampled_softmax_loss. https://www.tensorflow.org/api_docs/python/tf/nn/nce_loss.
Must be used in an output linear layer with a weight matrix of shape (num_classes, dim). When using ‘log_uniform’ sampler (default), optimal performance is typically achieved with the vocabulary list sorted in decreasing order of frequency (https://www.tensorflow.org/api_docs/python/tf/random/log_uniform_candidate_sampler).
- Parameters:
num_sampled (int) – Number of classes to be sampled. For sampled softmax, this is the number of classes to be used to estimate the sampled softmax. For noise contrastive estimation, this is the number of noise samples.
num_splits (int) – Number of different samples (each with ‘num_sampled’ classes) to be used per batch.
sampler (str) – Specify sampling distribution (“uniform”, “log_uniform”, “learned_unigram” or “fixed_unigram”).
nce_loss (bool) – If True, use noise contrastive estimation loss. Else (default), use the sampled softmax.
use_full_softmax (bool) – If True, compute the full softmax instead of sampling (can be used for evaluation).
remove_accidental_hits (bool|None) – If True, remove sampled classes that equal one of the target classes. If not specified (None), the value is determined based on the choosen objective. For sampled softmax this should be set to True; for NCE the default is False. Set this to True in case of NCE training and the objective is equal to sampled logistic loss.
sampler_args (dict[str]) – additional arguments for the candidate sampler. This is most relevant to the fixed_unigram sampler. See https://www.tensorflow.org/api_docs/python/tf/random/fixed_unigram_candidate_sampler for details.
nce_log_norm_term (float) – The logarithm of the constant normalization term for NCE.
- output_with_activation: OutputWithActivation | None[source]¶
Triplet Loss¶
- class returnn.tf.layers.basic.TripletLoss(margin, multi_view_training=False, **kwargs)[source]¶
Triplet loss: loss = max(margin + d(x_a, x_s) - d(x_a, x_d), 0.0) Triplet loss is used for metric learning in a siamese/triplet network. It should be used as a part of CopyLayer with 3 inputs corresponding to
x_a, x_s and x_d in a loss.
- Here we assume that x_a are anchor samples, x_s are samples where
at each position i in a minibatch x_ai and x_si belong to the same class, while pairs x_ai and x_di belong to different classes.
In this implementation the number of training examples is increased by extracting all possible same/different pairs within a minibatch.
- Parameters:
base_network (returnn.tf.network.TFNetwork)
use_flatten_frames (bool) – will use
returnn.tf.util.basic.flatten_with_seq_len_mask()
use_normalized_loss (bool) – the loss used in optimization will be normalized
custom_norm_factor (float|function|None) – The standard norm factor is 1/sum(target_seq_len) if the target has a time-axis, or 1/sum(output_seq_len) if there is no target and the output has a time-axis, or 1 otherwise. (See
Loss.init()
for details.) This is used for proper normalization of accumulated loss/error per epoch and also proper normalization per batch for reporting, no matter if use_normalized_loss is True or False. If you want to change this norm factor, you can set this. As a function, it takes (self=self, output=output, layer=layer) and returns a float scalar.custom_inv_norm_factor (LayerBase|None) – inverse of custom_norm_factor. Here we allow to pass a layer. Here we also allow to pass any shape and it will automatically be reduced via sum. So you could simply pass target_seq_len directly here. Basically, for all reporting, it uses sum(loss) * sum(custom_inv_norm_factor).
scale (float) – additional scale factor for the loss
_check_output_before_softmax (bool|None)
- output_with_activation: OutputWithActivation | None[source]¶
- init(output, output_with_activation=None, target=None, **kwargs)[source]¶
- Parameters:
output (Data) – generated output
output_with_activation (OutputWithActivation|None)
target (Data) – reference target from dataset
Via Layer Loss¶
- class returnn.tf.layers.basic.ViaLayerLoss(error_signal_layer=None, align_layer=None, loss_wrt_to_act_in=False, **kwargs)[source]¶
The loss error signal and loss value is defined as the output of another layer. That way, you can define any custom loss. This could e.g. be used together with the fast_bw layer.
This is a more custom variant of
AsIsLoss
, which simply takes the output of a layer as loss without redefining the error signal (gradient).- Parameters:
error_signal_layer (LayerBase)
align_layer (LayerBase)
loss_wrt_to_act_in (bool|str) – if True, we expect that the given output_with_activation is set, and the given error signal is w.r.t. the input of the specific activation function. A common example is the input to the softmax function, where the gradient is much more stable to define, e.g. y - z instead of y/z for cross entropy. If you specify a str, e.g. “softmax” or “log_softmax”, there is an additional check that the used activation function is really that one.
- classmethod transform_config_dict(d, network, get_layer)[source]¶
- Parameters:
d (dict[str]) – will modify inplace, the loss_opts
network (returnn.tf.network.TFNetwork)
get_layer (((str) -> LayerBase)) – function to get or construct another layer
- output_with_activation: OutputWithActivation | None[source]¶