Information theory and complexity¶
Akaike Information Criterion (AIC)¶
A measure of the quality of a model that combines accuracy with the number of parameters. Smaller AIC values mean the model is better. The formula is:
Where is the data and is the likelihood function.
The capacity of a machine learning model describes the complexity of the functions it can learn. If the model can learn highly complex functions it is said to have a high capacity. If it can only learn simple functions it has a low capacity.
The entropy of a discrete probability distribution is:
Measures the entropy of a joint distribution .
The joint entropy for a discrete distribution over two variables, X and Y is:
Measures the entropy of the distribution of one random variable given that the value of another is known.
The conditional entropy for a discrete distribution is given by:
The ability of a model to memorize the training set.
Fisher Information Matrix¶
An matrix of second-order partial derivatives where is the number of parameters in a model.
The matrix is defined as:
The Fisher Information Matrix is equal to the negative expected Hessian of the log likelihood.
An objective for training compressed representations that have predictive power.
Where and represent the mutual information between their respective arguments. is the input features, is the labels and is a representation of the input such as the activations of a hidden layer in a neural network. is a hyperparameter controlling the trade-off between compression and predictive power.
When the expression is minimised there is very little mutual information between the compressed representation and the input. At the same time, there is a lot of mutual information between the representation and the output, meaning the representation is useful for prediction.
A symmetric version of the KL-divergence. This means that , which is not true for the KL-divergence.
where is a mixture distribution equal to
A measure of the difference between two probability distributions. Also known as the KL-divergence and the relative entropy. In the usual use case one distribution is the true distribution of the data and the other is an approximation of it.
For discrete distributions it is given as:
If a point is outside the support of Q (), the KL-divergence will explode since is undefined. This can be dealt with by adding some random noise to Q. However, this introduces a degree of error and a lot of noise is often needed for convergence when using the KL-divergence for MLE. The Wasserstein distance, which also measures the distance between two distributions, does not have this problem.
Measures the dependence between two random variables.
If the variables are independent . If they are completely dependent .
Total variation distance¶
Like the Kullback-Leibler divergence, it is also a way of measuring the difference between two different probability distributions.
For two discrete distributions the total variation distance is given by: