Iterative Methods
What Are Iterative Methods?
Iterative methods are computational procedures that produce a sequence of successive approximations converging toward the solution of a mathematical problem. Rather than solving a system of equations or optimizing a function in a single closed-form step, an iterative method starts from an initial guess and applies a repeated update rule until the solution estimate satisfies a convergence criterion. They are indispensable in numerical linear algebra, nonlinear optimization, statistical estimation, and machine learning, where problem size or structure makes direct methods computationally infeasible.
The theoretical foundations of iterative methods span fixed-point theory, convex analysis, and numerical analysis. Their practical behavior depends on properties of the problem: the condition number of a linear system, the curvature of an objective function, or the completeness of the observed data all affect how quickly a given method converges and whether convergence is guaranteed at all.
Gradient-Based Methods
Gradient descent is the most widely used iterative optimization method. At each iteration, the current parameter estimate is updated by subtracting a scaled version of the gradient of the objective function, moving the estimate in the direction of steepest decrease. The step size, called the learning rate, controls the trade-off between convergence speed and stability. ScienceDirect's overview of the gradient descent method describes how stochastic gradient descent (SGD), which approximates the full gradient using a randomly selected subset of data at each step, is the standard training algorithm for large-scale neural networks because it reduces the per-iteration computational cost from linear to constant in the dataset size. Variants including Adam, RMSProp, and AdaGrad adapt the learning rate element-wise using running statistics of past gradients.
Newton's Method and Quasi-Newton Variants
Newton's method refines an initial estimate by using both the gradient and the Hessian matrix of the objective function, the matrix of second partial derivatives, to compute an update that accounts for local curvature. The quadratic convergence rate near the solution is much faster than the linear rate of gradient descent, but computing and inverting the Hessian exactly costs operations proportional to the cube of the parameter dimension, which is prohibitive for high-dimensional problems. ScienceDirect's overview of quasi-Newton methods explains how the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm and its limited-memory variant L-BFGS approximate the Hessian using information accumulated across iterations, achieving superlinear convergence at a cost comparable to first-order methods in terms of Hessian storage.
Conjugate Gradient Methods
The conjugate gradient (CG) method was developed to solve large sparse symmetric positive-definite systems of linear equations without forming or factoring the coefficient matrix. It constructs a sequence of search directions that are mutually conjugate with respect to the system matrix, guaranteeing convergence in at most n steps for an n-dimensional problem, and often converging in far fewer steps when the eigenvalues of the matrix are clustered. The CMU introduction to the conjugate gradient method presents the algorithm from first principles and explains how preconditioning, multiplying the system by an approximate inverse of the coefficient matrix, can dramatically reduce the effective condition number and accelerate convergence. Nonlinear conjugate gradient methods extend the idea to unconstrained optimization problems.
Expectation-Maximization Algorithm
The expectation-maximization (EM) algorithm solves maximum-likelihood estimation problems in which the observed data are incomplete, in the sense that the likelihood function is intractable to maximize directly. Each iteration consists of two steps: the E-step computes the expected value of the complete-data log-likelihood given the observed data and the current parameter estimate, and the M-step maximizes that expectation to produce a new parameter estimate. The Cornell Optimization Wiki's treatment of conjugate gradient methods provides context on the broader family of iterative methods within which EM sits. EM is guaranteed to increase the observed-data likelihood at every iteration and converges to a local maximum, making it the standard algorithm for fitting Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation.
Applications
Iterative methods have applications across a wide range of computational and engineering disciplines, including:
- Training deep neural networks and large language models using variants of stochastic gradient descent
- Solving large sparse linear systems arising from finite-element discretizations in structural and fluid mechanics
- Statistical model fitting with missing or latent data using the EM algorithm
- Image reconstruction in computed tomography and MRI using iterative algebraic methods
- Solving PageRank and other eigenvalue problems on large-scale web and social graphs