Arithmetic mean

The arithmetic mean of a set of inputs \{x_1,x_2,...,x_n\} is:

A(x_1,x_2,...,x_n) = \frac{1}{n}\sum_{i=1}^n x_i


The correlation between two random variables X and Y is:

\text{Corr}(X,Y) = \frac{\text{Cov}(X,Y)}{\sqrt{V(X)V(Y)}}


The covariance between two random variables X and Y is defined as:

\text{Cov}(X,Y) = \frac{1}{n}\sum_{i=1}^n (x_i - \mu_x)(y_i - \mu_y)

Covariance matrix

A square matrix \Sigma where \Sigma_{ij} = Cov(X_i,X_j) and X_i and X_j are two variables.

There are three types of covariance matrix:

  • Full - All entries are specified. Has O(n^2) parameters for n variables.
  • Diagonal - The matrix is diagonal, meaning all off-diagonal entries are zero. Variances can differ across dimensions but there is no interplay between the dimensions. Has O(n) parameters.
  • Spherical - The matrix is equal to the identity matrix multiplied by a constant. This means the variance is the same in all dimensions. Has O(1) parameters.

A valid covariance matrix is always symmetric and positive semi-definite.

Geometric mean

The geometric mean of a set of inputs \{x_1,x_2,...,x_n\} is:

G(x_1,x_2,...,x_n) = \sqrt[\leftroot{-2}\uproot{2}n]{x_1x_2...x_n}

Only applicable to positive numbers since otherwise it may involve taking the root of a negative number.

Harmonic mean

The harmonic mean for a set of inputs \{x_1,x_2,...,x_n\} is:

H(x_1,x_2,...,x_n) = n/\sum_{i=1}^n \frac{1}{x_i}

Cannot be computed if one of the numbers is zero since that would necessitate dividing by zero.

Used for the F1-score, which is the Harmonic mean of the precision and recall.


When the error of a model is correlated with one or more of the features.


Moving average

A moving average smooths a sequence of observations.

Exponential moving average (EMA)

A type of moving average in which the influence of past observations on the current average diminishes exponentially with time.

m_t = \alpha m_{t-1} + (1 - \alpha) x_t

m_t is the moving average at time t, x_t is the input at time t and 0 < \alpha < 1 is a hyperparameter. As \alpha decreases, the moving average weights recent observations more strongly.

Bias correction

If we initialise the EMA to equal zero (m_0 = 0) it will be very biased towards zero around the start. To correct this we can start with \alpha being close to 0 and gradually increase it. This effect can be achieved by rewriting the formula as:

m_t = \frac{1}{1 - \alpha^t}(\alpha m_{t-1} + (1 - \alpha) x_t)

See Adam: A Method for Stochastic Optimization, Kingma et al. (2015) for an example of this bias correction being used in practice.

Point estimate

An estimate for a parameter, such as the mean of a population for example. It describes the belief about this quantity with a single number, in contrast with a distribution which could be used to describe the belief for the parameter with multiple numbers.


Measures the asymmetry of a probability distribution.

= E\bigg[\bigg(\frac{X - \mu}{\sigma}\bigg)^3\bigg]

Standard deviation

The square root of the variance. The formula is:

\sigma = \sqrt{E[(X-\mu)^2]}

where \mu is the mean of X.

Sample standard deviation

s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n(x_i-\mu)^2}

Note that the above is the biased estimator for the sample standard deviation. Estimators which are unbiased exist but they each only apply to some population distributions.


The variance of X=\{x_1, ..., x_n\} is:

V(X) = E[(X-\mu)^2]

where \mu is the mean of X.

The formula can also be written as:

V(X) = \frac{1}{n}\sum_{i=1}^n (x_i - \mu)^2

Sample variance

When it is impractical to compute the variance over the entire population, we can take a sample instead and compute the sample variance. The formula for the unbiased sample variance is:

V(X) = \frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2