# Initialization¶

## He initialization¶

The weights are drawn from the following normal distribution:

where are the parameters for layer of the network and is the size of layer of the network.

The biases are initialized to zero as usual.

### Effectiveness¶

Was used to improve the state of the art for image classification (He et al., 2015) but the improvement over ReLU activations with Xavier initialization was very small, reducing top-1 error on ImageNet from 33.9% to 33.8%.

## Initialization with zeros¶

All of the weights are initialised to zero. Used for bias vectors since the weight matrix, which is initialized with random weights, provides the symmetry breaking.

## Orthogonal initialization¶

Initializes the weights as an orthogonal matrix. Useful for training very deep networks. Can be used to help with vanishing and exploding gradients in RNNs.

The procedure is as follows:

```
1. Generate a matrix of random numbers, X (eg from the normal distribution)
2. Perform the QR decomposition X = QR, resulting in an orthogonal matrix Q and an upper triangular matrix R.
3. Initialise with Q.
```

**Further reading**

### LSUV initialization¶

Layer-sequential unit-variance initialization. An iterative initialization procedure:

```
1. t_max = 10
2. tol_var = 0.05
3. pre-initialize the layers with orthonormal matrices as proposed in Saxe et al. (2013)
4. for each layer:
5. let w be the weights of the layer
6. let b be the output of the layer
7. for i in range(t_max):
8. w = w / sqrt(var(b))
9. if abs(var(b) - 1) < tol_var:
10. break
```

### Orthonormal initialization¶

- Initialise the weights from a standard normal distribution: .
- Perform a QR decomposition and use Q as the initialization matrix. Alternatively, do SVD and pick U or V as the initialization matrix.

## Symmetry breaking¶

An essential property of good initialization for fully connected layers. In a fully connected layer every hidden node has exactly the same set of inputs so if all nodes are initialised to the same value their gradients will also be identical. Thus they will never take on different values.

## Xavier initialization¶

Sometimes referred to as Glorot initialization.

where are the parameters for layer of the network and is the size of layer of the network.

Xavier initialization’s derivation assumes linear activations. Despite this it has been observed to work well in practice for networks that whose activations are nonlinear.