Evaluation metrics

Bits per character (BPC)

Used for assessing character-level language models.

Identical to the cross-entropy loss, but uses base 2 for the logarithm.


Score for assessing translation tasks. Also used for image captioning. Stands for BiLingual Evaluation Understudy.

Ranges from 0 to 1, where 1 corresponds to being identical to the reference translation. Often uses multiple reference translations.

BLEU: a Method for Automatic Evaluation of Machine Translation, Papineni et al. (2002)


The F1-score is the harmonic mean of the precision and the recall.

Using the harmonic mean has the effect that a good F1-score requires both a good precision and a good recall.

F_1 = 2 \cdot \frac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}}

Intersection over Union (IoU)

An accuracy score for two bounding boxes, where one is the prediction and the other is the target. It is equal to the area of their intersection divided by the area of their union.

Mean Average Precision

The main evaluation metric for object detection.

To calculate it first define the overlap criterion. This could be that the IoU for two bounding boxes be greater than 0.5. Since the ground truth is always that the class is present, this means each predicted box is either a true-positive or a false-positive. This means the precision can be calculated using TP/(TP+FN).


Used to measure how well a probabilistic model predicts a sample. It is equivalent to the exponential of the cross-entropy loss.


The probability that an example is in fact a positive, given that it was classified as one.

\text{precision} = \frac{\text{TP}}{\text{TP} + \text{FP}}

Where TP is the number of true positives and FP is the number of false positives.


The probability of classifying an example as a positive given that it is infact a positive.

\text{recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}

Where TP is the number of true positives and FN is the number of false negatives.


Root Mean Squared Error.

\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2}

ROC curve

Plots the true positive rate against the false positive rate for different values of the threshold in a binary classifier.

ROC stands for Receiver Operating Characteristic.

AUC (Area Under the Curve)

Summarises the ROC curve with one number, equal to the integral of the curve.


A common metric for evaluating regression algorithms that is easier to interpret than the RMSE but only valid for linear models.

Intuitively, it is the proportion of the variance in the y variable that has been explained by the model. As long as the model contains an intercept term the R-squared should be between 0 and 1.

R^2 = 1 - \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i (y_i - \bar{y})^2}

where \bar{y} = \sum_{i=1}^n y_i, the mean of y.