Information theory and complexity¶

Akaike Information Criterion (AIC)¶

A measure of the quality of a model that combines accuracy with the number of parameters. Smaller AIC values mean the model is better. The formula is:

Where is the data and is the likelihood function.

Capacity¶

The capacity of a machine learning model describes the complexity of the functions it can learn. If the model can learn highly complex functions it is said to have a high capacity. If it can only learn simple functions it has a low capacity.

Entropy¶

The entropy of a discrete probability distribution is:

Finite-sample expressivity¶

The ability of a model to memorize the training set.

Fisher Information Matrix¶

An matrix of second-order partial derivatives where is the number of parameters in a model.

The matrix is defined as:

The Fisher Information Matrix is equal to the negative expected Hessian of the log likelihood.

Information bottleneck¶

Where and represent the mutual information between their respective arguments. is the input features, is the labels and is a representation of the input such as the activations of a hidden layer in a neural network.

When the expression is minimised there is very little mutual information between the compressed representation and the input. There is a lot of mutual information between the representation and the output, meaning it is useful for prediction.

Jensen-Shannon divergence¶

Symmetric version of the KL-divergence.

where is a mixture distribution equal to

Kullback-Leibler divergence¶

A measure of the difference between two probability distributions. Also known as the relative entropy. In the usual use case one distribution is the true distribution of the data and the other is a model of it.

For discrete distributions it is given as:

Note that if a point is outside the support of Q (), the KL-divergence will explode since is undefined. This can be dealt with by adding some random noise to Q. However, this introduces a degree of error and a lot of noise is often needed for convergence when using the KL-divergence for MLE. The Wasserstein distance, which also measures the distance between two distributions, does not have this problem.

The KL-divergence is not symmetric.

A KL-Divergence of 0 means the distributions are identical. As the distributions become more different the divergence becomes more negative.

Mutual information¶

Measures the dependence between two random variables.

If the variables are independent . If they are completely dependent .