Information theory and complexity¶

Akaike Information Criterion (AIC)¶

A measure of the quality of a model that combines accuracy with the number of parameters. Smaller AIC values mean the model is better. The formula is:

$\text{AIC}(x,\theta) = 2|\theta| - 2 \ln L(\theta,x)$

Where $x$ is the data and $L$ is the likelihood function.

Capacity¶

The capacity of a machine learning model describes the complexity of the functions it can learn. If the model can learn highly complex functions it is said to have a high capacity. If it can only learn simple functions it has a low capacity.

Entropy¶

The entropy of a discrete probability distribution $p(x)$ is:

$H(X) = -\sum_{x \in X} p(x) \log p(x)$

Joint entropy¶

Measures the entropy of a joint distribution $p(x,y)$ .

The joint entropy for a discrete distribution over two variables, X and Y is:

$H(X,Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(x,y)$

Conditional entropy¶

Measures the entropy of the distribution of one random variable given that the value of another is known.

The conditional entropy for a discrete distribution is given by:

$H(X|Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log p(y|x)$

Finite-sample expressivity¶

The ability of a model to memorize the training set.

Fisher Information Matrix¶

An $N \times N$ matrix of second-order partial derivatives where $N$ is the number of parameters in a model.

The matrix is defined as:

$I(\theta)_{ij} = E[\frac{\partial \log f(X;\theta)}{\partial \theta_i} \frac{\partial \log f(X;\theta)}{\partial \theta_j}|\theta]$

The Fisher Information Matrix is equal to the negative expected Hessian of the log likelihood.

Information bottleneck¶

An objective for training compressed representations that have predictive power.

$\min I(X,T) - \beta I(T,Y)$

Where $I(X,T)$ and $I(T,Y)$ represent the mutual information between their respective arguments. $X$ is the input features, $Y$ is the labels and $T$ is a representation of the input such as the activations of a hidden layer in a neural network. $\beta$ is a hyperparameter controlling the trade-off between compression and predictive power.

When the expression is minimised there is very little mutual information between the compressed representation and the input. At the same time, there is a lot of mutual information between the representation and the output, meaning the representation is useful for prediction.

Proposed in

The information bottleneck method, Tishby et al. (2000)

Jensen-Shannon divergence¶

A symmetric version of the KL-divergence. This means that $JS(P,Q) = JS(Q,P)$ , which is not true for the KL-divergence.

$JS(P,Q) = \frac{1}{2}(D_{KL}(P||M) + D_{KL}(M||Q))$

where $M$ is a mixture distribution equal to $\frac{1}{2}(P + Q)$

Alternatives
KL divergence
Total variation distance
Wasserstein distance

Kullback-Leibler divergence¶

A measure of the difference between two probability distributions. Also known as the KL-divergence and the relative entropy. In the usual use case one distribution is the true distribution of the data and the other is an approximation of it.

For discrete distributions it is given as:

$D_{KL}(P||Q) = -\sum_i P_i \log \frac{Q_i}{P_i}$

Disadvantages¶

If a point is outside the support of Q ( $Q_i = 0$ ), the KL-divergence will explode since $\log (0)$ is undefined. This can be dealt with by adding some random noise to Q. However, this introduces a degree of error and a lot of noise is often needed for convergence when using the KL-divergence for MLE. The Wasserstein distance, which also measures the distance between two distributions, does not have this problem.

Properties¶

The KL-divergence is not symmetric.
A KL-Divergence of 0 means the distributions are identical. As the distributions become more different the divergence becomes more negative.

Advantage
Jensen-Shannon divergence
Total variation distance
Wasserstein distance

Mutual information¶

Measures the dependence between two random variables.

$I(X,Y) = -\sum_{x \in X} \sum_{y \in Y} p(x,y) \log \frac{p(x,y)}{p(x)p(y)}$

If the variables are independent $I(X,Y) = 0$ . If they are completely dependent $I(X,Y) = H(X) = H(Y)$ .

Rademacher complexity¶

TODO

Total variation distance¶

Like the Kullback-Leibler divergence, it is also a way of measuring the difference between two different probability distributions.

For two discrete distributions the total variation distance is given by:

TODO

Alternatives
Jensen-Shannon divergence
KL divergence
Wasserstein distance

VC dimension¶

Vapnik–Chervonenkis dimension is a measure of the capacity of a model.