Initialization

He initialization

The weights are drawn from the following normal distribution:

\theta^{(i)} \sim N(0, \sqrt{2/n_i})

where \theta^{(i)} are the parameters for layer i of the network and n_i is the size of layer i of the network.

The biases are initialized to zero as usual.

Was used to improve the state of the art for image classification (He et al., 2015) but the improvement over ReLU activations with Xavier initialization was very small, reducing top-1 error on ImageNet from 33.9% to 33.8%.

Orthogonal initialization

Useful for training very deep networks. Can be used to help with vanishing and exploding gradients in RNNs.

Explaining and illustrating orthogonal initialization for recurrent neural networks, Merity (2016)

LSUV initialization

Layer-sequential unit-variance initialization. An iterative initialization procedure:

1. t_max = 10
2. tol_var = 0.05
3. pre-initialize the layers with orthonormal matrices as proposed in Saxe et al. (2013)
4. for each layer:
5.    let w be the weights of the layer
6.    let b be the output of the layer
7.    for i in range(t_max):
8.        w = w / sqrt(var(b))
9.        if abs(var(b) - 1) < tol_var:
10.            break

Orthonormal initialization

  1. Initialise the weights from a standard normal distribution: \theta_i \sim N(0, 1).
  2. Perform a QR decomposition and use Q as the initialization matrix. Alternatively, do SVD and pick U or V as the initialization matrix.

Xavier initialization

Sometimes referred to as Glorot initialization.

\theta^{(i)} \sim U(-\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}},\frac{\sqrt{6}}{\sqrt{n_i+n_{i+1}}})

where \theta^{(i)} are the parameters for layer i of the network and n_i is the size of layer i of the network.

Xavier initialization’s derivation assumes linear activations. Despite this it has been observed to work well in practice.