Types of policy-learning algorithms

Model-based reinforcement learning

Models the environment in order to predict the distribution over states that will result from a given state-action pair.

Model-free reinforcement learning

Algorithms that learn the policy without requiring a model of the environment. Q-learning is an example.

Off-policy learning

The behaviour distribution does not follow the policy. Typically a more exploratory behaviour distribution is chosen. An example is Q-learning.

On-policy learning

The policy determines the samples the network is trained on. Can introduce bias to the estimator. An example is SARSA.

Policy-based method

Does not use a value function. Learns the policy explicitly, unlike value-based methods which instead choose the action which maximises the value function.

Policy gradient method

Policy learning algorithm. Iteratively alternates between improving the policy given the value function and the value function under the current policy.

Value-based methods

Have an implicit policy based on choosing the action which maximises the value function.