The space of all possible actions. May be discrete as in Chess or continuous as in many robotics tasks.
The probability distribution over sequences of state-action pairs that describes the behaviour of an agent.
Computes the value of a state given a policy. Represents the intuition that if the value at the next timestep is known for all possible actions, the optimal strategy is to select the action that maximizes that value plus the immediate reward.
Where is the immediate reward, is the discount rate, and are states and and are actions. is the value function for executing action in state .
In the context of games with a discrete action space like Chess and Go, breadth is the average number of possible moves.
Credit assignment problem¶
The problem of not knowing which actions helped and which hindered in getting to a particular reward.
Length of the game on average.
Between 0 and 1. Values closer to 0 make the agent concentrate on short-term rewards.
Analogous to a game. Ends when a terminal state is reached or after a predetermined number of steps.
Markov Decision Process (MDP)¶
Models the environment using Markov chains, extended with actions and rewards.
Partially Observable Markov Decision Process (POMDP)¶
Generalization of the MDP. The agent cannot directly observe the underlying state.
A function, that maps states to actions.
The difference in the cumulative reward between performing optimally and executing the given policy.
Maps state-action pairs to rewards.
Simple policy learning algorithm.
If a policy executes action in state with some corresponding value the update rule is:
Where means the derivative with respect to .
A state which ends the episode when reached. No further actions by the agent are possible.
The sequence of states and actions experienced by the agent.
Maps a state and an action to a new state.
The value of a state is equal to the expectation of the reward function given the state and the policy.