# Explore-exploit dilemma¶

Refers to the trade-off between exploitation, which maximises reward in the short-term, and exploration which sacrifices short-term reward for knowledge which can increase rewards in the long term.

See the multi-armed bandit problem for an example.

## Acquisition function¶

A function that decides the next point to sample while trying to maximize the cumulative reward, balancing exploration and exploitation.

Read the full explanation under Bayesian Optimization.

Examples of acquisition functions:

## Boltzmann exploration¶

Method for addressing the explore-exploit dilemma and choosing actions off-policy. Decrease the temperature over time. High temperatures incentivize exploration and low temperatures prioritize exploitation.

As approaches 0, it approximates a maximum function. As becomes large (above 3), the uniform distribution is approximated. The place where the distribution of probabilities is proportional to the differences in the Q-values does not appear to be fixed but usually occurs around 0.5.

## Count-based exploration¶

Use the count of the number of times an agent has visited a state to guide exploration.

Count-based exploration is only useful if the number of possible states is very small. If there are a large number of states the count for the vast majority of them will be zero, even if a very similar state has been visited before.

## Entropy of the policy¶

The entropy of the policy’s distribution over actions can be added to the loss function to incentivize exploration.

## Epsilon-greedy policy¶

A method for inducing exploration during training. Choose the greedy policy with probability and a random action with probability .

Unlike more sophisticated methods, epsilon-greedy cannot automatically decrease the exploration rate over time as the model becomes more certain.

## Greedy policy¶

A policy that always selects the best action according to its value or Q function without any regard for exploration.

## Intrinsic motivation¶

Technique in which exploration is motivated by the change in prediction error from learning a new piece of information.

## Multiple Armed Bandit Problem¶

Example of the explore-exploit tradeoff. Each machine (or bandit) provides a random reward from a probability distribution specific to that machine. A balance must be struck between picking the machine which has the best expected return (exploiting) and picking one’s about which not much is known in the hope that they may be better (exploring).

The objective is to maximise the sum of the rewards over a predetermined number of moves.

A simple approach that works reasonably well is to use an epsilon-greedy policy.

## Off-policy and on-policy learning¶

In off-policy learning the behaviour distribution does not follow the policy. This allows a more exploratory behaviour distribution to be chosen, using an epsilon-greedy policy or Boltzmann exploration, for example.

In on-policy learning the policy determines the samples the network is trained on. An example is SARSA.

## Thompson sampling¶

Method for addressing the explore-exploit dilemma. An action is chosen with probability equal to the probability that that action maximises the expected reward.

This can be done by maintaining a probability distribution for the parameters of the environment (eg which actions are associated with which rewards in a given state). Then parameters can be sampled from this distribution and the expected optimal action chosen given the sampled parameters.