Reward |
A positive or negative outcome that the agent receives as a consequence of its actions. In reinforcement learning, rewards define the goal of the agent. |
Agent |
The entity that interacts with the environment by perceiving states and taking actions based on them. The agent's goal is to learn a policy that maximizes the total reward it receives from the environment. |
Environment |
The external system in which the agent operates. The environment is sometimes called the world or the game, and it's the source of the agent's inputs and the receiver of its actions. It is defined by a set of states, actions, and rewards, as well as rules that determine how these interact. |
Policy |
A mapping between the states of the environment and the actions that the agent should take in those states. The policy is the main output of reinforcement learning, and the goal of the agent is to learn a policy that maximizes the total reward it receives from the environment. |
Q-learning |
A model-free reinforcement learning algorithm that estimates the value of each action in each state and then chooses the one with the highest expected reward. Q-learning is an off-policy algorithm, which means that it learns the value of the optimal policy even if it's not the one it's currently following. |
Value |
The expected total reward that the agent can accumulate from a given state or action. The value of a state is the expected total reward that the agent can accumulate from that state following its current policy. The value of an action is the expected total reward that the agent can accumulate if it takes that action in the current state and then follows its current policy. |
Exploration |
The strategy of choosing actions that are not necessarily the best ones according to the agent's current policy. Exploration is essential in reinforcement learning because it enables the agent to discover new states and actions that could potentially lead to higher reward in the future. |
Exploitation |
The strategy of choosing actions that are the best ones according to the agent's current policy. Exploitation is essential in reinforcement learning because it enables the agent to take advantage of the knowledge it has already acquired about the environment. |
Markov |
A property of a process in which the future state depends only on the present state and not on any previous state. Markov processes are the basis of many reinforcement learning algorithms because they enable the agent to reason about the future and to update its estimates of the value of each state based on the rewards it receives. |
State |
A configuration of the environment that the agent can perceive. The state encodes all the relevant information about the current situation, including the agent's position, the objects in the environment, the goal of the task, etc. The agent's goal is to learn a policy that maps states to actions in a way that maximizes the total reward it receives. |
Action |
A decision that the agent can make based on the current state. Actions can be discrete or continuous, and they determine how the agent interacts with the environment. The agent's goal is to learn a policy that maps states to actions in a way that maximizes the total reward it receives. |
Bellman |
A recursion that expresses the value of a state or action as a recursive function of the values of its successors. The Bellman recursion is the basis of many reinforcement learning algorithms because it enables the agent to update its estimates of the value of each state based on the rewards it receives and the values of its successors. The Bellman recursion is defined as: V(s) = R(s) + γ * Σ_s' P(s,a,s') * V(s') , where V(s) is the value of state s , R(s) is the reward obtained in state s , γ is discount factor that determines the importance of future rewards, P(s,a,s') is the probability of transitioning from state s to state s' when taking action a . |
Discount |
A factor that determines the importance of future rewards relative to immediate rewards. The discount factor is used in the Bellman recursion to balance short-term and long-term rewards. A discount factor of 0 implies that only the current reward is effective, while a discount factor of 1 implies that all future rewards are just as important as the current one. The choice of the discount factor depends on the nature of the environment and the goals of the agent. |
Model-free |
A type of reinforcement learning algorithm that does not build an explicit model of the environment. Instead, model-free algorithms learn from experience by estimating the value function or the policy based on the observed rewards, without making any assumptions about the underlying dynamics of the environment. Model-free algorithms are more flexible and general than model-based algorithms, but they may require more training samples and be less efficient in exploiting the structure of the environment. |
Model-based |
A type of reinforcement learning algorithm that builds an explicit model of the environment. Model-based algorithms learn the transition probabilities and rewards of the environment by interacting with it and then use these models to plan the optimal policy. Model-based algorithms are more efficient in exploiting the structure of the environment than model-free algorithms, but they may require more effort to construct and be less generalizable to other environments. |
Monte Carlo |
A method of estimating the value function or the policy by averaging the returns obtained from several episodes of the agent's interaction with the environment. Monte Carlo methods are model-free, do not make any assumptions about the underlying dynamics of the environment, and can handle both episodic and continuous tasks. Monte Carlo methods are slower than other methods, but they are generally more robust and can handle non-linear reward functions and stochastic environments. |
Temporal |
A learning algorithm that estimates the value function by updating the estimates of the value of each state or action based on the estimates of their successors. Temporal difference methods are model-free, do not require episodes of interaction with the environment, and can handle both episodic and continuous tasks. TD methods are faster than Monte Carlo methods, but they may require more training samples and be less robust in responding to changing reward functions. |
Eligibility |
A trace of the past states and actions that the agent has visited. The eligibility trace is used in some reinforcement learning algorithms to update the value function or the policy based on more than just the current reward. The eligibility trace assigns a weight to each past state or action based on how recently it was visited and how much reward it received, and then updates the value function or the policy based on the weighted sum of the traces. Eligibility traces enable the agent to learn from rare events and long-term dependencies that cannot be captured by simple TD methods. |
Gradient |
A method of updating the weights of the parameters of the value function or the policy based on the gradient of the value or the policy with respect to the weights. Gradient methods are model-free and can handle both discrete and continuous actions. Gradient methods are more computationally expensive than other methods, but they are generally more powerful and have fewer limitations. Gradient methods are widely used in deep reinforcement learning, where the value function or the policy is represented as a neural network. |
Policy |
A gradient-free method of estimating the policy by iteratively improving a stochastic estimate of the policy. Policy gradient methods are model-free, and can handle both discrete and continuous actions, as well as non-linear and non-stationary environments. Policy gradient methods are slower than value-based methods, but they are generally more sample-efficient and more robust to high-dimensional state spaces. Policy gradient methods are widely used in deep reinforcement learning, where the policy is represented as a neural network. |
Deep |
A type of reinforcement learning algorithm that uses deep neural networks to represent either the value function or the policy. Deep reinforcement learning algorithms can handle high-dimensional state spaces and can learn to perform complex tasks that are difficult to define or solve manually. Deep reinforcement learning algorithms are generally more powerful than traditional reinforcement learning algorithms, but they may require more training samples and be harder to interpret and debug. Deep reinforcement learning algorithms have been successfully applied to a wide range of problems, including game playing, control, optimization, and robotics. |