Information theory and complexity

Akaike Information Criterion (AIC)

A measure of the quality of a model that combines accuracy with the number of parameters. Smaller AIC values mean the model is better. The formula is:

\text{AIC}(x,\theta) = 2|\theta| - 2 \ln L(\theta,x)

Where x is the data and L is the likelihood function.


The capacity of a machine learning model describes the complexity of the functions it can learn. If the model can learn highly complex functions it is said to have a high capacity. If it can only learn simple functions it has a low capacity.


The entropy of a discrete probability distribution p is:

H(X) = -\sum_{x \in X} p(x) \log p(x)

Joint entropy

H(X,Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(x,y)

Conditional entropy

H(X|Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(y|x)

Finite-sample expressivity

The ability of a model to memorize the training set.

Fisher Information Matrix

An N \times N matrix of second-order partial derivatives where N is the number of parameters in a model.

The matrix is defined as:

I(\theta)_{ij} = E[\frac{\partial \log f(X;\theta)}{\partial \theta_i} \frac{\partial \log f(X;\theta)}{\partial \theta_j}|\theta]

The Fisher Information Matrix is equal to the negative expected Hessian of the log likelihood.

Information bottleneck

\min I(X,T) - \beta I(T,Y)

Where I(X,T) and I(T,Y) represent the mutual information between their respective arguments. X is the input features, Y is the labels and T is a representation of the input such as the activations of a hidden layer in a neural network.

When the expression is minimised there is very little mutual information between the compressed representation and the input. There is a lot of mutual information between the representation and the output, meaning it is useful for prediction.

Jensen-Shannon divergence

Symmetric version of the KL-divergence.

JS(P,Q) = \frac{1}{2}(D_{KL}(P||M) + D_{KL}(M||Q))

where M is a mixture distribution equal to \frac{1}{2}(P + Q)

See also: Wasserstein distance

Kullback-Leibler divergence

A measure of the difference between two probability distributions. Also known as the relative entropy. In the usual use case one distribution is the true distribution of the data and the other is a model of it.

For discrete distributions it is given as:

D_{KL}(P||Q) = -\sum_i P_i \log \frac{Q_i}{P_i}

Note that if a point is outside the support of Q (Q_i = 0), the KL-divergence will explode since \log (0) is undefined. This can be dealt with by adding some random noise to Q. However, this introduces a degree of error and a lot of noise is often needed for convergence when using the KL-divergence for MLE. The Wasserstein distance, which also measures the distance between two distributions, does not have this problem.

The KL-divergence is not symmetric.

A KL-Divergence of 0 means the distributions are identical. As the distributions become more different the divergence becomes more negative.

Mutual information

Measures the dependence between two random variables.

I(X,Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}

If the variables are independent I(X,Y) = 0. If they are completely dependent I(X,Y) = H(X) = H(Y).

Rademacher complexity


Total variation distance

Like the Kullback-Leibler divergence, it is also a way of measuring the difference between two different probability distributions.

See also: Wasserstein distance

VC dimension

Vapnik–Chervonenkis dimension is a measure of capacity.