Reinforcement Learning
What Is Reinforcement Learning?
Reinforcement learning is a branch of machine learning in which an agent learns to make decisions by interacting with an environment and receiving scalar feedback in the form of rewards or penalties. Rather than learning from labeled examples as in supervised learning, the agent observes a state, takes an action, transitions to a new state, and receives a reward signal. The objective is to discover a policy, a mapping from states to actions, that maximizes cumulative reward over time. The foundational textbook by Richard Sutton and Andrew Barto, "Reinforcement Learning: An Introduction", defines reinforcement learning as the problem of an agent learning from the consequences of its own actions, framed formally as a Markov decision process.
The Markov decision process provides the mathematical structure. It consists of a state space, an action space, a transition function specifying the probability of reaching a new state given a current state and action, and a reward function. The qualifier "Markov" means that the transition probability depends only on the current state, not on the history of previous states. Sutton, Barto, and Anderson published early work on the practical application of these ideas in the 1983 IEEE Transactions on Systems, Man, and Cybernetics, situating the field within control theory and adaptive behavior.
Core Algorithms
Two main algorithmic families have defined the development of reinforcement learning. Value-based methods learn a value function that estimates how much cumulative reward the agent can expect from a given state or state-action pair. Q-learning, introduced by Christopher Watkins in 1989, is the canonical value-based algorithm: it updates estimates of the Q-function (the value of taking a specific action in a specific state) using a temporal-difference rule that bootstraps from downstream Q-values. Under appropriate conditions, Q-learning converges to the optimal Q-function without requiring a model of the environment.
Policy gradient methods take a different approach, directly parameterizing and optimizing the policy without maintaining an explicit value table. The policy gradient theorem, derived by Sutton and colleagues, provides the gradient of expected cumulative reward with respect to the policy parameters, enabling gradient ascent. REINFORCE, actor-critic architectures, and proximal policy optimization (PPO) all belong to this family and differ in how they reduce the variance of the gradient estimate while controlling computational cost.
Deep Reinforcement Learning
The combination of reinforcement learning with deep neural networks as function approximators substantially expanded the range of problems the field can address. DeepMind's 2015 Deep Q-Network (DQN) demonstrated that a convolutional neural network could learn to play Atari games at human or superhuman level directly from pixel inputs, using experience replay and a target network to stabilize training. A widely cited survey on deep reinforcement learning by Arulkumaran et al. at arxiv (1708.05866) traces the subsequent rapid development of deep RL algorithms and their application to continuous control, strategy games, and robotics.
AlphaGo, developed by DeepMind and described in a 2016 Nature paper, combined deep neural networks with Monte Carlo tree search and reinforcement learning to defeat the world champion in Go, a game previously considered too complex for computer programs to master at human level. Li's 2017 survey on deep reinforcement learning (arxiv 1701.07274) provides a comprehensive technical overview of DQN, policy gradient methods, and their extensions, covering the principal algorithms and their convergence properties.
Applications
Reinforcement learning has applications in a wide range of disciplines, including:
- Game playing and strategy, from board games like Go and chess to real-time video games
- Robotics, where agents learn manipulation, locomotion, and dexterous grasping from simulation
- Autonomous vehicle control, including lane keeping, intersection negotiation, and path planning
- Resource allocation and scheduling in data centers and communication networks
- Drug discovery and molecular design, framing chemical synthesis as a sequential decision problem