Sequence models

Bidirectional RNN

Combines the outputs of two RNNs, one processing the input sequence from left to right (forwards in time) and the other from right to left (backwards in time). The two RNNs are stacked on top of each other and their states are typically combined by appending the two vectors. Bidirectional RNNs are often used in Natural Language problems, where we want to take the context from both before and after a word into account before making a prediction.

The basic bidirectional RNN can be defined as follows:

h^f_t = \tanh(W^f_h x_t + U^f_h h^f_{t-1})

h^b_t = \tanh(W^b_h x_{T-t} + U^b_h h^b_{t-1})

h_t = \text{concat}(h^f_t,h^b_t)

o_t = V h_t

Where h^f_t and h^b_t are the hidden states in the forwards and backwards directions respectively. T is the length of the sequence. Biases have been omitted for simplicity. x_t and o_t are the input and output states at time t, respectively.

Differentiable Neural Computer (DNC)

The memory is an NxW matrix. There are N locations, which can be selectively read and written to. Read vectors are weighted sums over the memory locations.

The heads use three forms of differentiable attention which:

  • Look up content.
  • Record transitions between consecutively written locations in an NxN temporal link matrix L.
  • Allocate memory for writing.

GRU (Gated Recurrent Unit)

Variation of the LSTM that is simpler to compute and implement, mergeing the cell and the hidden state.

Comparable performance to LSTMs on a translation task. Has two gates, a reset gate r and an update gate z. Not reducible from LSTM as there is only one tanh nonlinearity. Cannot ‘count’ as LSTMs can. Partially negates the vanishing gradient problem, as LSTMs do.

The formulation is:

r_t = \sigma(W_r x_t + U_r h_{t-1})

z_t = \sigma(W_z x_t + U_z h_{t-1})

\tilde h_t = \tanh(Wx + U(h_{t-1}*r))

h_t = z_t*h_{t-1} + (1-z) * \tilde h_t

Where * represents element-wise multiplication and W_r, U_r, W_z, U_z, W and U are parameters to be learnt. Note the lack of bias terms, in contrast with the LSTM.

z is used for constructing the new hidden vector and dictates which information is updated from the new output and which is remembered from the old hidden vector. r is used for constructing the output and decides which parts of the hidden vector will be used and which won’t be. The input for the current time-step is always used.

LSTM (Long Short-Term Memory)

A type of RNN with a memory cell as the hidden state. Uses a gating mechanism to ensure proper propagation of information through many timesteps. Traditional RNNs struggle to train for behaviour requiring long lags due to the exponential loss in error as back propagation proceeds through time (vanishing gradient problem). LSTMs store the error in the memory cell, making long memories possible. However, repeated access to the cell means the issue remains for many problems.

Can have multiple layers. The input gate determines when the input is significant enough to remember. The output gate decides when to output the value. The forget gate determines when the value should be forgotten.

The activations of the input, forget and output gates are i_t, f_t and o_t respectively. The state of the memory cell is C_t.

i_t = \sigma(W_i x_t + U_i h_{t-1} + b_i)

f_t = \sigma(W_f x_t + U_f h_{t-1} + b_f)

\tilde C_t = \tanh(W_c x_t + U_c h_{t-1} + b_c)

C_t = i_t*\tilde C_t + f_t*C_{t-1}

o_t = \sigma(W_o x_t + U_o h_{t-1} + V_o C_t + b_o)

h_t = o_t * \tanh(C_t)

Where * represents element-wise multiplication.

Each of the input, output and forget gates is surrounded by a sigmoid nonlinearity. This squashes the input so it is between 0 (let nothing through the gate) and 1 (let everything through).

The new cell state is the candidate cell state scaled by the input gate activation, representing how much we want to remember each value and added to the old cell state, scaled by the forget gate activation, how much we want to forget each of those values.

The \tanh functions serve to add nonlinearities.

Using an LSTM does not protect from exploding gradients.

Forget bias initialization

Helpful to initialize the bias of the forget gate to 1 in order to reduce the scale of forgetting at the start of training. This is done by default in TensorFlow.

Weight tying

Tie the input and output embeddings. May only be applicable to generative models. Discriminative ones do not have an output embedding.

Using the Output Embedding to Improve LMs, Press and Wolf (2016)

Cell clipping

Clip the activations of the memory cells to a range such as [-3,3] or [-50,50]. Helps with convergence problems by preventing exploding gradients and saturation in the sigmoid/tanh nonlinearities. Deep Recurrent Neural Networks for Acoustic Modelling, Chan and Lane (2015) LSTM RNN Architectures for Large Scale Acoustic Modeling, Sak et al. (2014)

Peep-hole connections

Allows precise timing to be learned, such as the frequency of a signal and other periodic patterns. Learning Precise Timing with LSTM Recurrent Networks, Ger et al. (2002) LSTM RNN Architectures for Large Scale Acoustic Modeling, Sak et al. (2014)

Neural Turing Machine (NTM)

Can infer simple algorithms like copying, sorting and associative recall.

Has two principal components:

  1. A controller, an LSTM. Takes the inputs and emits the outputs for the NTM as a whole.
  2. A memory matrix.

The controller interacts with the memory via a number of read and write heads. Read and write operations are ‘blurry’. A read is a convex combination of ‘locations’ or rows in the memory matrix, according to a weight vector over locations assigned to the read head. Writing uses an erase vector and an add vector. Both content-based and location-based addressing systems are used.

Similarity between vectors is measured by the cosine similarity.

Location-based addressing is designed for both iteration across locations and random-access jumps.

Content addressing

Compares a key vector to each location in memory, M_t(i) to produce a normalised weighting, w_t^c(i). t>0 is the key strength, used to amplify or attenuate the focus.


Blends the weighting produced at the previous time step and the content weighting. An ‘interpolation gate’ is emitted by each head. If g_t=1 the addressing is entirely content-based. If g_t=0, the addressing is entirely location-based.

Convolutional shift

Provides a rotation to the weight vector w_t^g. All index arithmetic is computed modulo N. The shift weighting s_t is a vector emitted by each head and defines a distribution over the allowed integer shifts.


Combats possible dispersion of weightings over time.

w_t(i) := \frac{w_t(i)^{\gamma_t}}{\sum_j w_t(j)^{\gamma_t}}

RNN (Recurrent Neural Network)

A type of network which processes a sequence and outputs another of the same length. It maintains a hidden state which is updated as new inputs are read in.

The most basic type of RNN has the functional form:

h_t = \tanh(W_h x_t + U_h h_{t-1} + b_h)

o_t = V h_t + b_o

Where x_t, o_t and h_t are the input, output and hidden states at time t, respectively.

RNN Encoder-Decoder

Common architecture for translation.

Consists of two RNNs. One encodes the input sequence into a fixed-length vector representation, the other decodes it into an output sequence. The original, proposed in Cho et al. (2014), uses the GRU to model sequential information using fewer parameters than the LSTM. Can be augmented with sampled softmax, bucketing and padding.

Sequence to sequence

Any machine learning task that takes one sequence and turns it into another.

Examples include:

  • Translation
  • Text-to-speech
  • Speech-to-text
  • Part of speech tagging (POS tagging)


Sequence model notable for not using recurrence or convolutions - only attention.

Attained state of the art accuracy on translation tasks (Vaswani et al., 2017) and has subsequently been used to get new records on a variety of other tasks (see ‘Used in’).

Attention layers use scaled-dot product attention.

Both the encoder and decoder are comprised of multiple blocks each with a multi-ahead attention layer, two fully-connected layers and two layer-normalisation components.

Multi-head attention

Concatenates the output of multiple parallel attention layers. Each layer has the same inputs (Q, K and V) but different weights. Vaswani et al. (2017) use 8 layers in each multi-head attention component but reduce the dimensionality of each from 512 to 64, which keeps the computational cost the same overall.

Positional encoding

Positional encodings are added (summed, not concatenated) to the input embeddings to allow the model to be aware of the sequence order.


Usage in pre-trained language models

Devlin et al. (2018) pre-train a bidirectional transformer and use this model to attain state of the art accuracy on a variety of natural language tasks. The transformer is first pre-trained to predict masked out tokens and predict next sentences and then fine-tuned on the tasks to be evaluated.