Temporal-difference learning

Temporal-difference learning optimizes the model to make predictions of the total return more similar to other, more accurate, predictions. These latter predictions are more accurate because they were made at a later point in time, closer to the end.

The TD error is defined as:


Q-learning is an example of TD learning.

Action-value function

Another term for the Q-function.

Actor-critic method

A type of on-policy temporal-difference method, as well as a policy-gradient algorithm.

The policy is the actor and the value function is the critic, with the ‘criticism’ being the TD error. If the TD error is positive the value of the action was greater than expected, suggesting the chosen action should be taken more often. If the TD error was negative the action had a lower value than expected, and so will be done less often in future states which are similar.

Unlike pure policy or value-based methods, actor-critic learns both a policy and a value function.

Apart from being off-policy, Q-learning is different as it estimates the value as a function of the state and the action, not just the state.

Asynchronous Advantage Actor-Critic (A3C)

An on-policy asynchronous RL algorithm. Can train both feedforward and recurrent agents.

Maintains a policy (the actor) \pi(a_t|s_t;\theta) and a value function (the critic) V(s_t;\theta_v) The policy and value functions are updated after t_{max} steps or when a terminal state is reached. The policy and value functions share all parameters apart from those in the final output layers. The policy network has a softmax over all actions (in the discrete case) and the value network has a single linear output.

The advantage function for doing action a_t in state s_t is the sum of discounted rewards plus the difference in the value functions between the states:

A(s_t,a_t;\theta,\theta_v) = \sum_{i=0}^{k-1}\gamma^i r_{t+i} + \gamma^k V(s_{t+k};\theta_v)-V(s_t;\theta_v), k \leq t_{max}

The loss function is:

L = \log \pi(a_t|s_t;\theta)(R_t-V(s_t;\theta_v)) + \beta H(\pi(s_t;\theta))


R_t=\sum_{k=0}^{\infty}\gamma^k r_{t+k}

and H(\pi(s_t;\theta) is the entropy of the policy’s distribution over actions. This term is used to incentivize exploration. \beta is a hyperparameter.

R_t-V(s_t;\theta_v) is the temporal difference term.

It’s multiplied by the probability assigned by the policy for the action at time t. This means policies which are more certain will be penalized more heavily for incorrectly estimating the value function.


Model-free iterative algorithm to find the optimal policy and a form of temporal difference learning.

Uses the update rule:

Q(s,a) := Q(s,a) + \alpha(r + \gamma \max_{a'}Q(s',a'))

where Q(a,s) is the value of performing action a in state s and performing optimally thereafter. s' is the state that results from performing action a in state s.

The Q-function

Expresses the expected total reward from taking the action in the given state and following the policy thereafter. Also known as the action-value function.

Q^\pi(s,a) = E[R|s,a,\pi]

Eventually converges to the optimal policy in any finite MDP. In its simplest form it uses tables to store values for the Q function, although this only works for very small state and action spaces.

Deep Q-learning

A neural network is used to approximate the optimal action-value function, Q(s,a) and the actions which maximise Q are chosen.

The action-value function is defined according to the Bellman equation.

The CNN takes an image of the game state as input and outputs a Q-value for each action in that state. This is more computationally efficient than having the action as an input to the network. The action with the largest corresponding Q-value is chosen.

Uses the loss function:

L = \mathbb{E}_{s,a}[(y - Q(s,a;\theta))^2]

where the target, y is defined as:

y = \mathbb{E}_{s'}[r + \gamma \max_{a'} Q(s',a';\theta)|s,a]

This means the target depends on the network weights, unlike in supervised learning. The loss function tries to change the parameters such that the estimate and the true Q-values are as close as possible, making forecasts of action-values more accurate.

Periodically freezing the target Q network helps prevent oscillations or divergence in the learning process.

Experience Replay

Sample experiences (s_t, a_t, r_t, s_{t+1}) to update the Q-function from a replay memory which retains the last N experiences. Mnih et al. (2013) set N to 1 million when training over a total of 10 million frames.

Contrast this with on-policy learning algorithms learn from events as they experience them. This can cause two problems:

  1. Most gradient descent algorithms rely on the assumption that updates are identically and independently distributed. Learning on-policy can break that assumption since the update at time t influences the state at the next timestep.
  2. Events are forgotten quickly. This can be particularly harmful in the case of rare but important events.

Both of these problems are solved by using experience replay.

The use of a replay memory means it is necessary to learn off-policy.

Prioritized Experience Replay

Samples from the replay memory according to a function of the loss. In contrast, in the standard approach (eg Mnih et al. (2013)) past experiences are selected uniformly at random from the replay memory.


Distributional Q-learning

Models the distribution of the value function, rather than simply its expectation.

Multi-step bootstrap targets

Replace the expression for the target y in the original deep Q-learning loss function with a sum of discounted rewards and action-values:

y = R^{(n)}_t + \gamma^n \max_{a'} Q(s_{t+n},a')


R^{(n)}_t = \sum_{k=0}^{n-1} \gamma^k r_{t+k+1}

The motivation for multi-step bootstrap targets is to speed up learning.

Hessel et al. (2017) set the hyperparameter n equal to 3.

Noisy parameters

A method for helping exploration when training that can be more effective than traditional epsilon-greedy appraoch. The linear component y = wx + b of the layers in the network are replaced with:

y = (\mu_w + \sigma_w * \epsilon_w)x + (\mu_b + \sigma_b * \epsilon_b)

where \mu_w and \sigma_w are learned parameter matrices of the same shape as w in the original equation. Similarly, \mu_b and \sigma_b are learned parameter vectors and have the same shape as b. \epsilon_w and \epsilon_b also have the same shape as w and b respectively, but are not learnt - they are random variables.

Since the amount of noise is learnt no hyperparameter-tuning is required, unlike epsilon-greedy, for example.

The noise parameters can be specified in two ways:

  • Independent Gaussian noise - Learn one noise parameter for each parameter in the main network.
  • Factorised Gaussian noise - The matrix of noise parameters is factorized into two vectors. This means the number of noise parameters needed for each layer is linear in its size rather than quadratic, as it is with independent Gaussian noise.


Approach that combines a number of existing improvemenets on the DQN algorithm to reach state of the art performance on the Atari 2600 benchmark.

It combines:

Rainbow: Combining Improvements in Deep Reinforcement Learning, Hessel et al. (2017)


An algorithm for learning a policy. Stands for state-action-reward-state-action.

The update rule for learning the Q-function is:

Q(s_t,a_t) := Q(s_t,a_t) + \alpha (r_{t+1} + \gamma Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))

Where 0 < \alpha < 1 is the learning rate.


1. Randomly initialize Q(s,a)
2. While not converged:
3.   Choose the action that maximizes Q(s,a)
4.   Compute the next state, given s and a.
5.   Apply the update rule for the Q-function.

Unlike Q-learning, SARSA is an on-policy algorithm and thus learns the Q-values associated with the policy it follows itself. Q-learning on the other hand is an off-policy algorithm and learns the value function while following an exploitation/exploration policy.