Explore-exploit dilemma

Refers to the trade-off between exploitation, which maximises reward in the short-term, and exploration which sacrifices short-term reward for knowledge which can increase rewards in the long term.

See the multi-armed bandit problem for an example.

Acquisition functions

A function that decides the next point to sample while trying to maximize the cumulative reward, balancing exploration and exploitation.


Probability of improvement

Pick the action which maximises the chance of getting to a state with a value greater than the current best state. The reward is 1 if the new state is better and 0 otherwise. This means that it will eschew possible large improvements in favour of more certain small ones.

If all nearby states are known to be worse this strategy can lead to getting stuck in local optima.

Expected improvement

Pick the action which maximises the expected improvement of that new state over the current best state. The reward is the difference between the values if the new state is better than the old one and zero otherwise.

A higher expected improvement can be obtained either by increasing either the variance or the mean of the value distribution of the next state.

Upper confidence bound

Calculate confidence intervals for the rewards from each action in a given state. Pick the action for which the upper bound of the reward is greatest. This will lead to actions with greater uncertainty being chosen since the confidence interval will be larger.

Boltzmann exploration

Method for addressing the explore-exploit dilemma and choosing actions off-policy. Decrease the temperature over time. High temperatures incentivize exploration and low temperatures prioritize exploitation.

\pi(a|s) = e^{\frac{Q(s,a)}{\tau}}/\sum_{a' \in A} e^{\frac{Q(s,a')}{\tau}}

As \tau approaches 0, it approximates a maximum function. As \tau becomes large (above 3), the uniform distribution is approximated. The place where the distribution of probabilities is proportional to the differences in the Q-values does not appear to be fixed but usually occurs around 0.5.

Count-based exploration

Use the count of the number of times an agent has visited a state to guide exploration. Only useful if the number of possible states is very small.

Entropy of the policy

The entropy of the policy’s distribution over actions H(\pi(s;\theta)) can be added to the loss function to incentivize exploration.

Asynchronous Methods for Deep Reinforcement Learning, Mnih et al. (2016)

Epsilon-greedy policy

A method for inducing exploration during training. Choose the greedy policy with probability 1-\epsilon and a random action with probability \epsilon.

Unlike more sophisticated methods, epsilon-greedy cannot automatically decrease the exploration rate over time as the model becomes more certain.

Greedy policy

A policy that always selects the best action according to its value or Q function without any regard for exploration.

Intrinsic motivation

Technique in which exploration is motivated by the change in prediction error from learning a new piece of information.

Multiple Armed Bandit Problem

Example of the explore-exploit tradeoff. Each machine (or bandit) provides a random reward from a probability distribution specific to that machine. A balance must be struck between picking the machine which has the best expected return (exploiting) and picking one’s about which not much is known in the hope that they may be better (exploring).

The objective is to maximise the sum of the rewards over a set number of possible actions.

A simple approach that works well is to use an epsilon-greedy policy.

Off-policy and on-policy learning

In off-policy learning the behaviour distribution does not follow the policy. This allows a more exploratory behaviour distribution to be chosen, using an epsilon-greedy policy or Boltzmann exploration, for example.

In on-policy learning the policy determines the samples the network is trained on. An example is SARSA.

Thompson sampling

Method for addressing the explore-exploit dilemma. An action is chosen with probability equal to the probability that that action maximises the expected reward.

This can be done by maintaining a probability distribution for the parameters of the environment (eg which actions are associated with which rewards in a given state). Then parameters can be sampled from this distribution and the expected optimal action chosen given the sampled parameters.