Affine layer

Synomym for fully-connected layer.


An attention layer takes a query vector and uses it, combined with key vectors, to compute a weighted sum of value vectors. If a key is determined to be highly compatible with the query the weight for the associated value will be high.

Attention has been used to improve image classification, image captioning, speech recognition, generative models and learning algorithmic tasks, but has probably had the largest impact on neural machine translation.

Computational complexity

Let n be the length of a sequence and d be the embedding size.

A recurrent network’s complexity will be O(nd^2).

A soft attention mechanism must look over every item in the input sequence for every item in the output sequence, resulting in complexity that is quadratic in the sequence length: O(n^2d).

Additive attention

Let x = \{x_1,...,x_T\} be the input sequence and y = \{y_1,...,y_U\} be the output sequence.

There is an encoder RNN whose hidden state at index t we refer to as h_t. The decoder RNN’s state at time t is s_t.

Attention is calculated over all the words in the sequence form a weighted sum, known as the context vector. This is defined as:

c_t = \sum_{j=1}^{T} \alpha_{tj} h_j

where \alpha_{tj} is the jth element of the softmax of e_t.

The attention given to a particular input word depends on the hidden states of the encoder and decoder RNNs.

e_{tj} = a(s_{t-1}, h_j)

The decoder’s hidden state is computed according to the following expression, where f represents the decoder.

s_i = f(s_{t-1},y_{t-1},c_t)

To predict the output sequence we take the decoder hidden state and the context vector and feed them into a fully connected softmax layer g which gives a distribution over the output vocabulary.

y_t = g(s_t,c_t)

Dot-product attention

Returns a weighted average over the values, V.

\text{Attention}(Q,K,V) = \text{softmax}(QK^T)V

Where Q is the query matrix, K is the matrix of keys and V is the matrix of values. \text{softmax}(QK^T) determines the weight of each value in the result, based on the similarity between the query and the value’s corresponding key.

The queries and keys have the same dimension.

The query might be the hidden state of the decoder, the key the hidden state of the encoder and the value the word vector at the corresponding position.

Scaled dot-product attention

Adds a scaling factor \sqrt{d_k}, equal to the dimension of K:

\text{Attention}(Q,K,V) = \text{softmax}(\frac{QK^T}{\sqrt{d_k}})V

This addition to the formula is intended to ensure the gradients do not become small when d_k grows large.

Hard attention

Form of attention that attends only to one input, unlike soft attention. Trained using the REINFORCE algorithm since, unlike other forms of attention, it is not differentiable.



Soft attention

Forms of attention that attend to every input to some extent, meaning they can be trained through backpropagation. Contrast with hard attention, which attends exclusively to one input.

Convolutional layer

Transforms an image according to the convolution operation shown below, where the image on the left is the input and the image being created on the right is the output:


Let x be a matrix representing the image and k be another representing the kernel, which is of size NxN. c(x,k) is the matrix that results from convolving them together. Then, formally, convolution applies the following formula:

c(x,k)_{ij} = \sum_{r=-M}^{M} \sum_{s=-M}^{M} x_{i+r,j+s} k_{r+M,s+M}

Where M = (N - 1)/2.


Applying the kernel to pixels near or at the edges of the image will result in needing pixel values that do not exist. There are two ways of resolving this:

  • Only apply the kernel to pixels where the operation is valid. For a kernel of size k this will reduce the image by (k-1)/2 pixels on each side.
  • Pad the image with zeros to allow the operation to be defined.


The same convolution operation is applied to every pixel in the image, resulting in a considerable amount of weight sharing. This means convolutional layers are quite efficient in terms of parameters. Additionally, if a fully connected layer was used to represent the functionality of a convolutional layer most of its parameters would be zero since the convolution is a local operation. This further increases efficiency.

The number of parameters can be further reduced by setting a stride so the convolution operation is only applied every m pixels.

1x1 convolution

These are actually matrix multiplications, not convolutions. They are a useful way of increasing the depth of the neural network since they are equivalent to f(hW), where f is the activation function.

If the number of channels decreases from one layer to the next they can be also be used for dimensionality reduction.

Dilated convolution

Increases the size of the receptive field of the convolution layer.

Used in WaveNet: A Generative Model for Raw Audio, van den Oord et al. (2016).

Separable convolution/filter

A filter or kernel is separable if it (a matrix) can be expressed as the product of a row vector and a column vector. This decomposition can reduce the computational cost of the convolution. Examples include the Sobel edge detection and Gaussian blur filters.

K = xx^T, x \in \mathbb{R}^{n \times 1}

Transposed convolutional layer

Sometimes referred to as a deconvolutional layer. Can be used for upsampling.

Pads the input with zeros and then applies a convolution. Has parameters which must be learned, unlike the upsampling layer.

Dense layer

Synomym for fully-connected layer.

Fully-connected layer

Applies the following function:

h' = f(hW + b)

f is the activation function. h is the output of the previous hidden layer. W is the weight matrix and b is known as the bias vector.

Hierarchical softmax

A layer designed to improve efficiency when the number of output classes is large. Its complexity is logarithmic in the number of classes rather than linear, as for a standard softmax layer.

A tree is constructed where the leaves are the output classes.

Alternative methods include Noise Contrastive Estimation and Negative Sampling.

Classes for Fast Maximum Entropy Training, Goodman (2001)

Inception layer

Using convolutional layers means it is necessary to choose the kernel size (1x1, 3x3, 5x5 etc.). Inception layers negate this choice by using multiple convolutional layers with different kernel sizes and concatenating the results.


Padding can ensure the different convolution sizes still have the same size of output. The pooling component can be concatenated by using a stride of length 1 for the pooling.

5x5 convolutions are expensive so the 1x1 convolutions make the architecture computationally viable. The 1x1 convolutions perform dimensionality reduction by reducing the number of filters. This is not a characteristic necessarily found in all 1x1 convolutions. Rather, the authors have specified to have the number of output filters less than the number of input filters.

9 inception layers are used in GoogLeNet, a 22-layer deep network and state of the art solution for the ILSVRC in 2014.

Pooling layer

Max pooling

Transforms the input by taking the max along a particular dimension. In sequence processing this is usually the length of the sequence.

Mean pooling

Also known as average pooling. Identical to max-pooling except the mean is used instead of the max.

RoI pooling

Used to solve the problem that the regions of interest (RoI) identified by the bounding boxes can be different shapes in object recognition. The CNN requires all inputs to have the same dimensions.

The RoI is divided into a number of rectangles of fixed size (except at the edges). If doing 3x3 RoI pooling there will be 9 rectangles in each RoI. We do max-pooling over each RoI to get 3x3 numbers.

Softmax layer

A fully-connected layer with a softmax activation function.

Upsampling layer

Simple layer used to increase the size of its input by repeating its entries. Does not have any parameters.

Example of a 2D upsampling layer: