Loss functions¶

For classification problems, $y$ is equal to 1 if the example is a positive and 0 if it is a negative. $\hat{y}$ can take on any value (although predicting outside of the (0,1) interval is unlikely to be useful).

Classification¶

Cross-entropy loss¶

Loss function for classification.

$L(y,\hat{y}) = -\sum_i \sum_c y_{i,c} \log(\hat{y}_{i,c})$

where c are the classes. $y_{i,c}$ equals 1 if example $i$ is in class $c$ and 0 otherwise. $\hat{y}_{i,c}$ is the predicted probability that example $i$ is in class $c$ .

For discrete distributions (ie classification problems rather than regression) this is the same as the negative log-likelihood loss.

Hinge loss¶

Let positives be encoded as $y = 1$ and negatives as $y = -1$ . Then the hinge loss is defined as:

$L(y,\hat{y}) = \max\{0, m - y \hat{y}\}$

The margin $m$ is a hyperparameter that is commonly set to 1.

Used for training SVMs.

Focal loss¶

Variant of the cross-entropy loss, designed for use on datasets with severe class imbalance. It is defined as:

$L(p) = -(1 - p)^\gamma \log(p)$

Where $\gamma$ is a hyperparameter that determines the relative importance of the classes. If $\gamma = 0$ the focal loss is equivalent to the cross-entropy loss.

Proposed in

Focal Loss for Dense Object Detection, Lin et al. (2017)

Noise Contrastive Estimation¶

Like negative sampling, this is a technique for efficient learning when the number of output classes is large. Useful for language modelling.

A binary classification task is created to disambiguate pairs that are expected to be close to each other from ‘noisy’ examples put together at random.

In essence, rather than estimating $P(y|x)$ , NCE estimates $P(C=1|x,y)$ where $C = 1$ if $y$ has been sampled from the real distribution and $C = 0$ if $y$ has been sampled from the noise distribution.

NCE makes training time at the output layer independent of the number of classes. It remains linear in time at evaluation, however.

$L(x,y) = -\sum_i \log(P(C_i=1|x_i,y_i)) + \sum_{j = 1}^k \log(1 - P(C_i=1|x_i,y^n_j))$

$k$ is a hyperparameter, denoting the number of noise samples for each real sample. $y_i$ is a label sampled from the data distribution and $y^n_j$ is one sampled from the noise distribution. $C_i = 1$ if the pair $(x,y)$ was drawn from the data distribution and 0 otherwise.

Used in
Noise Contrastive Estimation: A New Estimation Principle for Unnormalized Statistical Models, Gutmann and Hyvarinen (2010)
Learning Word Embeddings Efficiently with Noise Contrastive Estimation, Mnih and Kavukcuoglu (2013)
RNNLM Training with NCE for Speech Recognition, Chen et al. (2015)

Embeddings¶

Contrastive loss¶

Loss function for learning embeddings, often used in face verification.

The inputs are pairs of examples $x_1$ and $x_2$ where $y = 1$ if the two examples are of the similar and $0$ if not.

$L(x_1,x_2,y) = y d(x_1,x_2)^2 + (1 - y) \max\{0, m - d(x_1,x_2)\}^2$

Where $x_1$ and $x_2$ are the embeddings for the two examples and $m$ is a hyperparameter called the margin. $d(x,y)$ is a distance function, usually the Euclidean distance.

Intuition¶

If $y = 1$ the two examples $x_1$ and $x_2$ are similar and we want to minimize the distance $d(x_1,x_2)$ . Otherwise ( $y = 0$ ) we wish to maximize it.

The margin¶

If $y = 0$ we want to make $d(x_1,x_2)$ as large as possible to minimize the loss. However, beyond the threshold for classifying the example as a negative increasing this distance will not have any effect on the accuracy. The margin ensures this intuition is reflected in the loss function. Using the margin means increasing $d(x_1,x_2)$ beyond $m$ has no effect.

There is no margin for when $y = 1$ . This case is naturally bounded by 0 as the Euclidean distance cannot be negative.

Example paper¶

Deep Learning Face Representation by Joint Identification-Verification, Sun et al. (2014)

Negative sampling¶

The problem is reframed as a binary classification problem.

$L(x_0,x_1,y) = y\log \sigma(f(x_0) \cdot f(x_1)) + (1-y_i)\log(\sigma(-f(x_0) \cdot f(x_1)))$

where $x_0$ and $x_1$ are two examples, $f$ is the learned embedding function and $y = 1$ if the pair $(x_0,x_1)$ are expected to be similar and $y = 0$ otherwise. The dot product measures the distance between the two embeddings.

Noise Contrastive Estimation¶

A binary classification task is created to disambiguate pairs that are expected to be close to each other from ‘noisy’ examples put together at random.

$L(x_0,x_1,y) = y\log \sigma(f(x_0) \cdot f(x_1)) + (1-y_i)\log(1-\sigma(f(x_0) \cdot f(x_1)))$

where $x_0$ and $x_1$ are two examples, $f$ is the learned embedding function and $y = 1$ if the pair $(x_0,x_1)$ are expected to be similar and $y = 0$ if not (because they have been sampled from the noise distribution). The dot product measures the distance between the two embeddings and the sigmoid function transforms it to be between 0 and 1 so it can be interpreted as a prediction for a binary classifier.

This means maximising the probability that actual samples are in the dataset and that noise samples aren’t in the dataset. Parameter update complexity is linear in the size of the vocabulary. The model is improved by having more noise than training samples, with around 15 times more being optimal.

Triplet loss¶

Used for training embeddings with triplet networks. A triplet is composed of an anchor ( $a$ ), a positive example ( $p$ ) and a negative example ( $n$ ). The positive examples are similar to the anchor and the negative examples are dissimilar.

$L(a,p,n) = \sum_i \max\{0, m + d(a_i,p_i) - d(a_i,n_i)\}$

Where $m$ is a hyperparameter called the margin. $d(x,y)$ is a distance function, usually the the Euclidean distance.

The margin¶

We want to minimize $d(a_i,p_i)$ and maximize $d(a_i,n_i)$ . The former is lower-bounded by 0 but the latter has no upper bound (distances can be arbitrarily large). However, beyond the threshold to classify a pair as a negative, increasing this distance will not help improve the accuracy, a fact which needs to be reflected in the loss function. The margin does this by ensuring that there is no gain from increasing $d(a_i,n_i)$ beyond $m + d(a_i,p_i)$ since the loss will be set to 0 by the maximum.

Regression¶

Huber loss¶

A loss function used for regression. It is less sensitive to outliers than the squared loss since there is only a linear relationship between the size of the error and the loss beyond $\delta$ .

$L(y,\hat{y};\delta) = \begin{cases} \frac{1}{2}(y - \hat{y})^2, & \ |y - \hat{y}| \leq \delta \\ \delta(|y - \hat{y}| - \frac{1}{2}\delta), & \text{otherwise} \end{cases}$

Where $\delta$ is a hyperparameter.

Squared loss¶

A loss function used for regression.

$L(y,\hat{y}) = \sum_i (y_i - \hat{y}_i)^2$

Disadvantages¶

The squaring means this loss function weights large errors more than smaller ones, relative to the magnitude of the error. This can be particularly harmful in the case of outliers. One solution is to use the Huber loss.