# Initialization¶

## He initialization¶

The weights are drawn from the following normal distribution:

where are the parameters for layer of the network and is the size of layer of the network.

The biases are initialized to zero as usual.

Was used to improve the state of the art for image classification (He et al., 2015) but the improvement over ReLU activations with Xavier initialization was very small, reducing top-1 error on ImageNet from 33.9% to 33.8%.

## Orthogonal initialization¶

Useful for training very deep networks. Can be used to help with vanishing and exploding gradients in RNNs.

Explaining and illustrating orthogonal initialization for recurrent neural networks, Merity (2016)

### LSUV initialization¶

Layer-sequential unit-variance initialization. An iterative initialization procedure:

```
1. t_max = 10
2. tol_var = 0.05
3. pre-initialize the layers with orthonormal matrices as proposed in Saxe et al. (2013)
4. for each layer:
5. let w be the weights of the layer
6. let b be the output of the layer
7. for i in range(t_max):
8. w = w / sqrt(var(b))
9. if abs(var(b) - 1) < tol_var:
10. break
```

### Orthonormal initialization¶

- Initialise the weights from a standard normal distribution: .
- Perform a QR decomposition and use Q as the initialization matrix. Alternatively, do SVD and pick U or V as the initialization matrix.

## Xavier initialization¶

Sometimes referred to as Glorot initialization.

where are the parameters for layer of the network and is the size of layer of the network.

Xavier initialization’s derivation assumes linear activations. Despite this it has been observed to work well in practice.