Types of policy-learning algorithms¶
Model-based reinforcement learning¶
Models the environment in order to predict the distribution over states that will result from a given state-action pair.
Model-free reinforcement learning¶
Algorithms that learn the policy without requiring a model of the environment. Q-learning is an example.
The behaviour distribution does not follow the policy. Typically a more exploratory behaviour distribution is chosen. An example is Q-learning.
The policy determines the samples the network is trained on. Can introduce bias to the estimator. An example is SARSA.
Does not use a value function. Learns the policy explicitly, unlike value-based methods which instead choose the action which maximises the value function.
Policy gradient method¶
Policy learning algorithm. Iteratively alternates between improving the policy given the value function and the value function under the current policy.
Have an implicit policy based on choosing the action which maximises the value function.