Types of policy-learning algorithms

Model-based reinforcement learning

Models the environment in order to predict the distribution over states that will result from a given state-action pair.

Model-free reinforcement learning

Algorithms that learn the policy without requiring a model of the environment. Q-learning is an example.

Off-policy learning

The behaviour distribution does not follow the policy. Typically a more exploratory behaviour distribution is chosen. An example is Q-learning.

On-policy learning

The policy determines the samples the network is trained on. Can introduce bias to the estimator. An example is SARSA.

Policy-based method

Does not use a value function. Learns the policy explicitly, unlike value-based methods.

Policy gradient method

Policy learning algorithm. Iteratively alternates between improving the policy given the value function and the value function under the current policy.

Value-based methods

Have an implicit policy based on choosing the action which maximises the value function.