Convolutional networks


Performed considerably better than the state of the art at the time. Has 60 million parameters, 650,000 neurons and includes five convolutional layers.

The two ‘streams’ shown in the paper only exist to allow training on two GPUs.


CNN that won the ILSVRC 2014 challenge. Composed of 9 inception layers.


A basic convolutional network, historically used for the MNIST dataset.

Residual network (ResNet)

An architecture that uses skip connections to create very deep networks. The original paper achieved 152 layers, 8 times deeper than VGG nets. Used for image recognition, winning first place in the ILSVRC 2015 classification task. Residual connections can also be used to create deeper RNNs such as Google’s 16-layer RNN encoder-decoder (Wu et al., 2016).

Uses shortcut connections performing the identity mapping, which are added to the outputs of the stacked layers. Each residual block uses the equation:

x = f(x) + x

where f is a sequence of layers such as convolutions and nonlinearities.


There are a number of hypothesized reasons for why residual networks are effective:

  • Shorter paths: The skip connections provide short paths between the input and output, making residual networks able to avoid the vanishing gradient problem more easily.
  • Increased depth: As a result of the reduced vanishing gradients problem ResNets can be trained with more layers, enabling more sophisticated functions to be learnt.
  • Ensembling effect: Veit et al. (2016) demonstrate that a residual network can be seen as an ensemble of sub-networks of different lengths.

Comparison with Highway Networks

Highway Networks, Srivastava et al (2015) also use skip connections to attempt to make it easier to train very deep networks. In contrast to Residual Networks their connections are gated as follows:

y = H(x, W_H) \cdot T(x, W_T) + x \cdot (1 - T(x, W_T))

Comparisons between the accuracies of the two approaches suggest the gating is not useful and so is detrimental overall as it increases the number of parameters and the computational complexity of the network.


A CNN that secured the first and second place in the 2014 ImageNet localization and classification tracks, respectively. VGG stands for the team which submitted the model, Oxford’s Visual Geometry Group. The VGG model consists of 16–19 weight layers and uses small convolutional filters of size 3x3 and 1x1.