Search algorithm to find the most likely output sequence.
In a sequence to sequence problem, the next element in the decoded sequence is highly dependent on the previous one. If the output vocabulary is of size and the sequence is of length the complexity of finding the best sequence is by brute force. Therefore a good heuristic algorithm is needed.
Greedy search runs in time but has poor accuracy. Beam search is a compromise between these two extremes.
In the context sequence-to-sequence beam search is a tree search algorithm. The inner nodes are partial solutions and the leaves are full solutions. At each iteration beam search keeps track of the k best partial solutions. The parameter k is known as the beam width.
1. # frontier maps partial solutions to scores 2. initialise frontier to contain the root node with a score of 0 3. while the end of the sequence has not been reached: 4. select the candidate from frontier with the best score 5. # expand the chosen candidate 6. add all the children of the candidate to frontier 7. compute the scores of all the new nodes in frontier 8. # prune candidates 9. remove all entries not in the top k from frontier
Combines the outputs of two RNNs, one processing the input sequence from left to right (forwards in time) and the other from right to left (backwards in time). The two RNNs are stacked on top of each other and their states are typically combined by appending the two vectors. Bidirectional RNNs are often used in Natural Language problems, where we want to take the context from both before and after a word into account before making a prediction.
The basic bidirectional RNN can be defined as follows:
Where and are the hidden states in the forwards and backwards directions respectively. is the length of the sequence. Biases have been omitted for simplicity. and are the input and output states at time t, respectively.
Differentiable Neural Computer (DNC)¶
The memory is an NxW matrix. There are N locations, which can be selectively read and written to. Read vectors are weighted sums over the memory locations.
The heads use three forms of differentiable attention which:
- Look up content.
- Record transitions between consecutively written locations in an NxN temporal link matrix L.
- Allocate memory for writing.
GRU (Gated Recurrent Unit)¶
Variation of the LSTM that is simpler to compute and implement, mergeing the cell and the hidden state.
Comparable performance to LSTMs on a translation task. Has two gates, a reset gate and an update gate . Not reducible from LSTM as there is only one tanh nonlinearity. Cannot ‘count’ as LSTMs can. Partially negates the vanishing gradient problem, as LSTMs do.
The formulation is:
Where represents element-wise multiplication and , , , , and are parameters to be learnt. Note the lack of bias terms, in contrast with the LSTM.
is used for constructing the new hidden vector and dictates which information is updated from the new output and which is remembered from the old hidden vector. is used for constructing the output and decides which parts of the hidden vector will be used and which won’t be. The input for the current time-step is always used.
LSTM (Long Short-Term Memory)¶
A type of RNN with a memory cell as the hidden state. Uses a gating mechanism to ensure proper propagation of information through many timesteps. Traditional RNNs struggle to train for behaviour requiring long lags due to the exponential loss in error as back propagation proceeds through time (vanishing gradient problem). LSTMs store the error in the memory cell, making long memories possible. However, repeated access to the cell means the issue remains for many problems.
Can have multiple layers. The input gate determines when the input is significant enough to remember. The output gate decides when to output the value. The forget gate determines when the value should be forgotten.
The activations of the input, forget and output gates are , and respectively. The state of the memory cell is .
Where represents element-wise multiplication.
Each of the input, output and forget gates is surrounded by a sigmoid nonlinearity. This squashes the input so it is between 0 (let nothing through the gate) and 1 (let everything through).
The new cell state is the candidate cell state scaled by the input gate activation, representing how much we want to remember each value and added to the old cell state, scaled by the forget gate activation, how much we want to forget each of those values.
The functions serve to add nonlinearities.
Using an LSTM does not protect from exploding gradients.
Forget bias initialization¶
Helpful to initialize the bias of the forget gate to 1 in order to reduce the scale of forgetting at the start of training. This is done by default in TensorFlow.
Tie the input and output embeddings. May only be applicable to generative models. Discriminative ones do not have an output embedding.
Clip the activations of the memory cells to a range such as [-3,3] or [-50,50]. Helps with convergence problems by preventing exploding gradients and saturation in the sigmoid/tanh nonlinearities. Deep Recurrent Neural Networks for Acoustic Modelling, Chan and Lane (2015) LSTM RNN Architectures for Large Scale Acoustic Modeling, Sak et al. (2014)
Allows precise timing to be learned, such as the frequency of a signal and other periodic patterns. Learning Precise Timing with LSTM Recurrent Networks, Ger et al. (2002) LSTM RNN Architectures for Large Scale Acoustic Modeling, Sak et al. (2014)
Neural Turing Machine (NTM)¶
Can infer simple algorithms like copying, sorting and associative recall.
Has two principal components:
- A controller, an LSTM. Takes the inputs and emits the outputs for the NTM as a whole.
- A memory matrix.
The controller interacts with the memory via a number of read and write heads. Read and write operations are ‘blurry’. A read is a convex combination of ‘locations’ or rows in the memory matrix, according to a weight vector over locations assigned to the read head. Writing uses an erase vector and an add vector. Both content-based and location-based addressing systems are used.
Similarity between vectors is measured by the cosine similarity.
Location-based addressing is designed for both iteration across locations and random-access jumps.
Compares a key vector to each location in memory, to produce a normalised weighting, . is the key strength, used to amplify or attenuate the focus.
Blends the weighting produced at the previous time step and the content weighting. An ‘interpolation gate’ is emitted by each head. If the addressing is entirely content-based. If , the addressing is entirely location-based.
Provides a rotation to the weight vector . All index arithmetic is computed modulo N. The shift weighting is a vector emitted by each head and defines a distribution over the allowed integer shifts.
RNN (Recurrent Neural Network)¶
A type of network which processes a sequence and outputs another of the same length. It maintains a hidden state which is updated as new inputs are read in.
The most basic type of RNN has the functional form:
Where , and are the input, output and hidden states at time t, respectively.
Common architecture for translation.
Consists of two RNNs. One encodes the input sequence into a fixed-length vector representation, the other decodes it into an output sequence. The original, proposed in Cho et al. (2014), uses the GRU to model sequential information using fewer parameters than the LSTM. Can be augmented with sampled softmax, bucketing and padding.
Sequence to sequence¶
Any machine learning task that takes one sequence and turns it into another.
- Part of speech tagging (POS tagging)
Sequence model notable for not using recurrence or convolutions - only attention.
Attained state of the art accuracy on translation tasks (Vaswani et al., 2017) and has subsequently been used to get new records on a variety of other tasks (see ‘Used in’).
Attention layers use scaled-dot product attention.
Both the encoder and decoder are comprised of multiple blocks each with a multi-ahead attention layer, two fully-connected layers and two layer-normalisation components.
Concatenates the output of multiple parallel attention layers. Each layer has the same inputs (Q, K and V) but different weights. Vaswani et al. (2017) use 8 layers in each multi-head attention component but reduce the dimensionality of each from 512 to 64, which keeps the computational cost the same overall.
Positional encodings are added (summed, not concatenated) to the input embeddings to allow the model to be aware of the sequence order.
Usage in pre-trained language models¶
Devlin et al. (2018) pre-train a bidirectional transformer and use this model to attain state of the art accuracy on a variety of natural language tasks. The transformer is first pre-trained to predict masked out tokens and predict next sentences and then fine-tuned on the tasks to be evaluated.