Used to reduce overfitting and improve generalization to data that was not seen during the training process.

General principles

  • Small changes in the inputs should not produce large changes in the outputs.
  • Sparsity. Most features should be inactive most of the time.
  • It should be possible to model the data well using a relatively low dimensional distribution of independent latent factors.


For each training case, omit each hidden unit with some constant probability. This results in a network for each training case, the outputs of which are combined through averaging. If a unit is not omitted, its value is shared across all the models. Prevents units from co-adapting too much.

Dropout’s effectiveness could be due to:

  • An ensembling effect. ‘Training a neural network with dropout can be seen as training a collection of 2^n thinned networks with extensive weight sharing’ - Srivastava et al. (2014)
  • Restricting the network’s ability to co-adapt weights. The idea is that if a node is not reliably included, it would be ineffective for nodes in the next layer to rely on it’s output. Weights that depend strongly on each other correspond to a sharp local minimum as a small change in the weights is likely to damage accuracy significantly. Conversely, nodes that take input from a variety of sources will be more resilient and reside in a shallower local minimum.

Can be interpreted as injecting noise inside the network.

Variational dropout

Applied to RNNs. Unlike normal dropout, the same dropout mask is retained over all timesteps, rather than sampling a new one each time the cell is called. Compared to normal dropout, this is less likely to disrupt the RNN’s ability to learn long-term dependencies.

Generalization error

The difference between the training error and the test error.

Label smoothing

Replaces the labels with a weighted average of the true labels and the uniform distribution.


When the network fails to generalize well, leading to worse performance on the test set but better performance on the training set. Caused by the model fitting on noise resulting from the dataset being only a finite representation of the true distribution.

Weight decay

L1 weight decay

Adds the following term to the loss function:

C \sum_{i=1}^k |\theta_i|

C > 0 is a hyperparameter.

L1 weight decay is mathematically equivalent to MAP estimation with a Laplacian prior on the parameters.

L2 weight decay

Adds the following term to the loss function:

C \sum_{i=1}^k {\theta_i}^2

C > 0 is a hyperparameter.

L2 weight decay is mathematically equivalent to doing MAP estimation where the prior on the parameters is Gaussian:

q(\theta) = N(0,C^{-1})


Weight decay works by making large parameters costly. Therefore during optimisation the most important parameters will tend to have the largest magnitude. The unimportant ones will be close to zero.

Sometimes referred to as ridge regression or Tikhonov regularisation in statistics.


Method for regularizing RNNs. A subset of the hidden units are randomly set to their previous value (h_t = h_{t-1}).