Loss functions

For classification problems, y is equal to 1 if the example is a positive and 0 if it is a negative. \hat{y} can take on any value (although predicting outside of the (0,1) interval is unlikely to be useful).

Classification

Cross-entropy loss

Loss function for classification.

L(y,\hat{y}) = -\sum_i \sum_c y_{i,c} \log(\hat{y}_{i,c})

where c are the classes. y_{i,c} equals 1 if example i is in class c and 0 otherwise. \hat{y}_{i,c} is the predicted probability that example i is in class c.

_images/crossentropy.png

For discrete distributions (ie classification problems rather than regression) this is the same as the negative log-likelihood loss.

Hinge loss

Let positives be encoded as y = 1 and negatives as y = -1. Then the hinge loss is defined as:

L(y,\hat{y}) = \max\{0, m - y \hat{y}\}

The margin m is a hyperparameter that is commonly set to 1.

_images/hinge.png

Used for training SVMs.

Focal loss

Variant of the cross-entropy loss, designed for use on datasets with severe class imbalance. It is defined as:

L(p) = -(1 - p)^\gamma \log(p)

Where \gamma is a hyperparameter that determines the relative importance of the classes. If \gamma = 0 the focal loss is equivalent to the cross-entropy loss.

Noise Contrastive Estimation

Like negative sampling, this is a technique for efficient learning when the number of output classes is large. Useful for language modelling.

A binary classification task is created to disambiguate pairs that are expected to be close to each other from ‘noisy’ examples put together at random.

In essence, rather than estimating P(y|x), NCE estimates P(C=1|x,y) where C = 1 if y has been sampled from the real distribution and C = 0 if y has been sampled from the noise distribution.

NCE makes training time at the output layer independent of the number of classes. It remains linear in time at evaluation, however.

L(x,y) = -\sum_i \log(P(C_i=1|x_i,y_i)) + \sum_{j = 1}^k \log(1 - P(C_i=1|x_i,y^n_j))

k is a hyperparameter, denoting the number of noise samples for each real sample. y_i is a label sampled from the data distribution and y^n_j is one sampled from the noise distribution. C_i = 1 if the pair (x,y) was drawn from the data distribution and 0 otherwise.

Embeddings

Contrastive loss

Loss function for learning embeddings, often used in face verification.

The inputs are pairs of examples x_1 and x_2 where y = 1 if the two examples are of the similar and 0 if not.

L(x_1,x_2,y) = y d(x_1,x_2)^2 + (1 - y) \max\{0, m - d(x_1,x_2)\}^2

Where x_1 and x_2 are the embeddings for the two examples and m is a hyperparameter called the margin. d(x,y) is a distance function, usually the Euclidean distance.

Intuition

If y = 1 the two examples x_1 and x_2 are similar and we want to minimize the distance d(x_1,x_2). Otherwise (y = 0) we wish to maximize it.

The margin

If y = 0 we want to make d(x_1,x_2) as large as possible to minimize the loss. However, beyond the threshold for classifying the example as a negative increasing this distance will not have any effect on the accuracy. The margin ensures this intuition is reflected in the loss function. Using the margin means increasing d(x_1,x_2) beyond m has no effect.

There is no margin for when y = 1. This case is naturally bounded by 0 as the Euclidean distance cannot be negative.

Negative sampling

The problem is reframed as a binary classification problem.

L(x_0,x_1,y) = y\log \sigma(f(x_0) \cdot f(x_1)) + (1-y_i)\log(\sigma(-f(x_0) \cdot f(x_1)))

where x_0 and x_1 are two examples, f is the learned embedding function and y = 1 if the pair (x_0,x_1) are expected to be similar and y = 0 otherwise. The dot product measures the distance between the two embeddings.

Noise Contrastive Estimation

A binary classification task is created to disambiguate pairs that are expected to be close to each other from ‘noisy’ examples put together at random.

L(x_0,x_1,y) = y\log \sigma(f(x_0) \cdot f(x_1)) + (1-y_i)\log(1-\sigma(f(x_0) \cdot f(x_1)))

where x_0 and x_1 are two examples, f is the learned embedding function and y = 1 if the pair (x_0,x_1) are expected to be similar and y = 0 if not (because they have been sampled from the noise distribution). The dot product measures the distance between the two embeddings and the sigmoid function transforms it to be between 0 and 1 so it can be interpreted as a prediction for a binary classifier.

This means maximising the probability that actual samples are in the dataset and that noise samples aren’t in the dataset. Parameter update complexity is linear in the size of the vocabulary. The model is improved by having more noise than training samples, with around 15 times more being optimal.

Triplet loss

Used for training embeddings with triplet networks. A triplet is composed of an anchor (a), a positive example (p) and a negative example (n). The positive examples are similar to the anchor and the negative examples are dissimilar.

L(a,p,n) = \sum_i \max\{0, m + d(a_i,p_i) - d(a_i,n_i)\}

Where m is a hyperparameter called the margin. d(x,y) is a distance function, usually the the Euclidean distance.

The margin

We want to minimize d(a_i,p_i) and maximize d(a_i,n_i). The former is lower-bounded by 0 but the latter has no upper bound (distances can be arbitrarily large). However, beyond the threshold to classify a pair as a negative, increasing this distance will not help improve the accuracy, a fact which needs to be reflected in the loss function. The margin does this by ensuring that there is no gain from increasing d(a_i,n_i) beyond m + d(a_i,p_i) since the loss will be set to 0 by the maximum.

Regression

Huber loss

A loss function used for regression. It is less sensitive to outliers than the squared loss since there is only a linear relationship between the size of the error and the loss beyond \delta.

L(y,\hat{y};\delta) =
        \begin{cases}
            \frac{1}{2}(y - \hat{y})^2, & \ |y - \hat{y}| \leq \delta \\
            \delta(|y - \hat{y}| - \frac{1}{2}\delta), & \text{otherwise}
        \end{cases}

Where \delta is a hyperparameter.

_images/huber.png

Squared loss

A loss function used for regression.

L(y,\hat{y}) = \sum_i (y_i - \hat{y}_i)^2

_images/squared.png

Disadvantages

The squaring means this loss function weights large errors more than smaller ones, relative to the magnitude of the error. This can be particularly harmful in the case of outliers. One solution is to use the Huber loss.