Probability¶

“Admits a density/distribution”¶

If a variable ‘admits a distribution’, that means it can be described by a probability density function. Contrast with

$P(X=a) = \begin{cases} 1 ,& \text{if } a = 0 \\ 0 ,& \text{otherwise} \end{cases}$

which cannot be described by a pdf, so we would say that $X$ does not admit a distribution.

Bayes’ rule¶

$P(A|B) = \frac{P(B|A)P(A)}{P(B)}$

Bayesian inference¶

The use of Bayes’ rule to update a probability distribution as the amount of evidence changes.

Chain rule of probability¶

Gives the joint probability for a set of variables as the product of conditionals and a prior.

$P(A_n, ..., A_1) = \prod_{i=1}^{n}P(A_i|A_1,...,A_{i-1})$

For three variables this looks like:

$P(A_3,A_2,A_1) = P(A_3|A_2,A_1) \cdot P(A_2|A_1) \cdot P(A_1)$

Change of variables¶

In the context of probability densities the change of variables formula describes how one distribution $p(y)$ can be given in terms of another, $p(x)$ :

$p(y) = {|\frac{\partial f(x)}{\partial x}|}^{-1} p(x)$

Where $f$ is an invertible function.

Conjugate prior¶

If the prior $p(\theta)$ and the posterior $p(X|\theta)$ are both from the same family of distributions (eg Beta) the likelihood $p(X|\theta)$ is distributed according to the table below:

Likelihood	Conjugate prior
Bernoulli	Beta
Binomial	Beta
Negative binomial	Beta
Categorical	Dirichlet
Multinomial	Dirichlet
Poisson	Gamma

Distributions¶

Bernoulli¶

Distribution for a random variable which is 1 with probability $p$ and zero with probability $1-p$ .

Special case of the Binomial distribution, which generalizes the Bernoulli to multiple trials.

$P(x = k;p) = \begin{cases} p, & \text{if } k = 1\\ 1-p, & \text{if } k = 0 \end{cases}$

Beta¶

Family of distributions defined over $[0,1]$ . This makes them particularly useful for defining the distribution of probabilities.

Binomial¶

Distribution for the number of successes in n trials, each with probability p of success and 1-p of failure. The probability density function is:

$P(x = k;n,p) = {n\choose k} p^k (1-p)^{n-k}$

Is closely approximated by the Poisson distribution when n is large and p is small.

Boltzmann¶

$P(x_i;T,\epsilon) = \frac{1}{Q} e^{-\epsilon_i / T}$

where $\epsilon_i$ is the energy of $x_i$ , $T$ is the temperature of the system and $Q$ is a normalising constant.

Categorical¶

Generalizes the Bernoulli distribution to more than two categories.

$P(x = k;p) = p_k$

Dirichlet¶

Multivariate version of the Beta distribution.

Conjugate prior of the categorical and multinomial distributions.

Gamma¶

Can be used to model the amount of something a particular period, area or volume. For example, the amount of rainfall in an area in a month. This is as opposed to the Poisson which models the distribution for the number of discrete events.

Geometric¶

Special case of the Negative Binomial distribution.

Gibbs¶

See Boltzmann Distribution.

Gumbel¶

Used to model the distribution of the maximum (or the minimum) of a number of samples of various distributions.

Used by

Categorical Reparameterization with Gumbel-Softmax, Jang et al. (2016)

Hypergeometric¶

Models the probability of k successes in n draws without replacement from a population of size N, where K of the objects in the population have the desired characteristic. Similar to the Binomial, except that the draws are made without replacement which means they are no longer independent.

Multinomial¶

The distribution for n trials, each with k possible outcomes.

When n and k take on specific values or ranges the Multinomial distribution has specific names.

	$k = 2$	$k \geq 2$
$n = 1$	Bernoulli	Categorical
$n \geq 1$	Binomial	Multinomial

Multivariate¶

This section summarises some univariate distributions and their multivariate versions:

Univariate	Multivariate
Bernoulli	Binomial
Categorical	Multinomial
Beta	Dirichlet
Gamma	Wishart

Negative Binomial¶

Distribution of the number of successes before a given number of failures occur.

Poisson¶

Used to model the number of events which occur within a particular period, area or volume.

Zipf¶

A distribution that has been observed to be a good model for things like the frequency of words in a language, where there are a few very popular words and a long tail of lesser known ones.

For a population of size n, the frequency of the kth most frequent item is:

$\frac{1/{k^s}}{\sum_{i=1}^n 1/i^s}$

where $s \geq 0$ is a hyperparameter

Inference¶

Probabilistic inference is the task of determining the probability of a particular outcome.

Law of total probability¶

$P(X) = \sum_i P(X|Y=y_i)P(Y=y_i)$

Likelihood¶

The likelihood of the parameters $\theta$ given the observation data $X$ is equal to the probability of the data given the parameters.

$L(\theta|X) = P(X|\theta)$

Marginal distribution¶

The most basic sort of probability, $P(x)$ . Contrast with the conditional distribution $P(x|y)$ or the joint $P(x,y)$ .

Marginal likelihood¶

A likelihood function in which some variable has been marginalised out (removed by summation).

MAP estimation¶

Maximum a posteriori estimation. A point estimate for the parameters $\theta$ , given the observations $X$ . Can be seen as a regularization of MLE since it also incorporates a prior distribution. Uses Bayes’ rule to incorporate a prior over the parameters and find the parameters that are most likely given the data (rather than the other way around). Unlike with MLE (which is a bit of a simplification), the most likely parameters given the data are exactly what we want to find.

$\hat{\theta}_{MAP}(X) = \arg \max_\theta p(\theta|X) = \arg \max_\theta \frac{p(X|\theta)q(\theta)}{\int_{\theta'} p(O|\theta')q(\theta') d\theta'} = \arg \max_\theta p(X|\theta)q(\theta)$

Where $q(\theta)$ is the prior for the parameters.

Note that in the equation above the denominator vanishes since it does not depend on $\theta$ .

Maximum likelihood estimation (MLE)¶

Finds the set of parameters $\theta$ that are most likely, given the data $X$ . Since priors over parameters are not taken into account unless MAP estimation is taking place, this is equivalent to finding the parameters that maximize the probability of the data given the parameters:

$\hat{\theta}_{MLE}(X) = \arg \max_\theta p(X|\theta)$

Normalizing flow¶

A function that can be used to transform one random variable into another. The function must be invertible and have a tractable Jacobian.

Extensively used for density estimation.

Prior¶

A probability distribution before any evidence is taken into account. For example the probability that it will rain where there are no observations such as cloud cover.

Improper prior¶

A prior whose probability distribution has infinitesimal density over an infinitely large range. For example, the distribution for picking an integer at random.

Informative and uninformative priors¶

Below are some examples for each:

Informative

The temperature is normally distributed with mean 20 and variance 3.

Uninformative

The temperature is positive.
The temperature is less than 200.
All temperatures are equally likely.

‘Uninformative’ can be a misnomer. ‘Not very informative’ would be more accurate.

Posterior¶

A conditional probability distribution that takes evidence into account. For example, the probability that it will rain, given that it is cloudy.