Initialization

He initialization

The weights are drawn from the following normal distribution:

\theta^{(i)} \sim N(0, \sqrt{2/n_i})

where \theta^{(i)} are the parameters for layer i of the network and n_i is the size of layer i of the network.

The biases are initialized to zero as usual.

Effectiveness

Was used to improve the state of the art for image classification (He et al., 2015) but the improvement over ReLU activations with Xavier initialization was very small, reducing top-1 error on ImageNet from 33.9% to 33.8%.

Initialization with zeros

All of the weights are initialised to zero. Used for bias vectors since the weight matrix, which is initialized with random weights, provides the symmetry breaking.

Orthogonal initialization

Initializes the weights as an orthogonal matrix. Useful for training very deep networks. Can be used to help with vanishing and exploding gradients in RNNs.

The procedure is as follows:

1. Generate a matrix of random numbers, X (eg from the normal distribution)
2. Perform the QR decomposition X = QR, resulting in an orthogonal matrix Q and an upper triangular matrix R.
3. Initialise with Q.

LSUV initialization

Layer-sequential unit-variance initialization. An iterative initialization procedure:

1. t_max = 10
2. tol_var = 0.05
3. pre-initialize the layers with orthonormal matrices as proposed in Saxe et al. (2013)
4. for each layer:
5.    let w be the weights of the layer
6.    let b be the output of the layer
7.    for i in range(t_max):
8.        w = w / sqrt(var(b))
9.        if abs(var(b) - 1) < tol_var:
10.            break

Orthonormal initialization

  1. Initialise the weights from a standard normal distribution: \theta_i \sim N(0, 1).
  2. Perform a QR decomposition and use Q as the initialization matrix. Alternatively, do SVD and pick U or V as the initialization matrix.

Symmetry breaking

An essential property of good initialization for fully connected layers. In a fully connected layer every hidden node has exactly the same set of inputs so if all nodes are initialised to the same value their gradients will also be identical. Thus they will never take on different values.

Xavier initialization

Sometimes referred to as Glorot initialization.

\theta^{(i)} \sim U(-\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}},\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}})

where \theta^{(i)} are the parameters for layer i of the network and n_i is the size of layer i of the network.

Xavier initialization’s derivation assumes linear activations. Despite this it has been observed to work well in practice for networks that whose activations are nonlinear.