Absorbing state

See terminal state.

Action space

The space of all possible actions. May be discrete as in Chess or continuous as in many robotics tasks.

Behaviour distribution

The probability distribution over sequences of state-action pairs that describes the behaviour of an agent.

Bellman equation

Computes the value of a state given a policy. Represents the intuition that if the value at the next timestep is known for all possible actions, the optimal strategy is to select the action that maximizes that value plus the immediate reward.

V(s,a) = \mathbb{E}_{s'}[r(s,a) + \gamma \max_{a'} V(s',a')]

Where r is the immediate reward, \gamma is the discount rate, s and s' are states and a and a' are actions. V(s,a) is the value function for executing action a in state s.


In the context of games with a discrete action space like Chess and Go, breadth is the average number of possible moves.

Control policy

See policy.

Credit assignment problem

The problem of not knowing which actions helped and which hindered in getting to a particular reward.


Length of the game on average.

Discount factor

Between 0 and 1. Values closer to 0 make the agent concentrate on short-term rewards.


Analogous to a game. Ends when a terminal state is reached or after a predetermined number of steps.

Markov Decision Process (MDP)

Models the environment using Markov chains, extended with actions and rewards.

Partially Observable Markov Decision Process (POMDP)

Generalization of the MDP. The agent cannot directly observe the underlying state.


A function, \pi that maps states to actions.


The difference in the cumulative reward between performing optimally and executing the given policy.

Reward function

Maps state-action pairs to rewards.


Simple policy learning algorithm.

If a policy \pi_\theta executes action a in state s with some corresponding value v the update rule is:

\Delta \theta = \alpha \nabla_\theta \pi_\theta(s,a) v

Where \nabla_\theta means the derivative with respect to \theta.

Terminal state

A state which ends the episode when reached. No further actions by the agent are possible.


The sequence of states and actions experienced by the agent.

Transition function

Maps a state and an action to a new state.

Value function

The value of a state is equal to the expectation of the reward function given the state and the policy.

V(s) = E[R|s,\pi]