Optimal Transport

What Is Optimal Transport?

Optimal transport is a branch of mathematics and applied statistics concerned with finding the most efficient way to move mass from one distribution to another, where efficiency is measured by a specified cost function. Originally formulated by Gaspard Monge in 1781 as the problem of moving soil from one location to another at minimum total effort, the theory was placed on rigorous mathematical ground by Leonid Kantorovich in 1942 through a linear programming relaxation. The field now belongs to the intersection of probability theory, functional analysis, and computational mathematics, with a growing role in machine learning, signal processing, and scientific computing.

The central object of the theory is the Wasserstein distance (also called the Earth Mover's Distance), which measures the minimum cost required to transport one probability distribution into another. Unlike the Kullback-Leibler divergence or total variation distance, the Wasserstein distance respects the geometry of the underlying space, making it sensitive to how far mass must travel rather than only whether probability mass is present at a given location. This geometric sensitivity gives optimal transport favorable properties in problems where distributions have non-overlapping support, as described in Carnegie Mellon's lecture notes on optimal transport and Wasserstein distance.

Mathematical Foundations

The Kantorovich formulation seeks a joint distribution (transport plan) over pairs of locations, whose marginals match the source and target distributions, that minimizes the expected transport cost. When the cost function is the squared Euclidean distance, the solution is the 2-Wasserstein distance, and under mild regularity conditions, the optimal transport plan is induced by a deterministic transport map, recovering Monge's original problem. Duality theory converts the linear programming problem into a maximizing dual problem over Lipschitz functions, a result that directly motivates Wasserstein generative adversarial networks (WGANs) in deep learning. The Brenier theorem establishes that, for squared-distance cost, the optimal transport map is the gradient of a convex function, linking optimal transport to convex analysis and Riemannian geometry.

Computational Methods

The Kantorovich problem with discrete distributions reduces to a linear program with O(n^2) variables, solvable in O(n^3 log n) time, which is computationally expensive for large-scale problems. The Sinkhorn algorithm, a regularized version that adds an entropy term to the transport cost, reduces computation to O(n^2 / epsilon^2) and is highly parallelizable on modern hardware, making it the dominant approach for machine learning applications. Sliced Wasserstein distance, which projects distributions onto one-dimensional subspaces and averages the resulting one-dimensional transport costs, further reduces complexity to O(n log n) per projection, at the cost of some geometric fidelity. The Python Optimal Transport (POT) library implements these and related algorithms and serves as a standard reference implementation for practitioners.

Applications in Signal Processing and Machine Learning

Optimal transport has become a tool for comparing and interpolating between probability distributions arising in diverse engineering contexts. In generative modeling, the Wasserstein distance provides a training objective for generative adversarial networks that is more stable than the Jensen-Shannon divergence. In domain adaptation, a transport map aligns a source dataset with a target dataset, enabling model transfer across measurement conditions. In natural language processing, the Word Mover's Distance uses optimal transport to measure semantic distance between documents. In acoustic measurements, transport-based metrics compare spectral distributions from microphone arrays or sonar systems, offering alternatives to point-wise spectral distance measures. A comprehensive treatment of these applications appears in Peyré and Cuturi's Computational Optimal Transport.

Applications

Optimal transport has applications in a wide range of fields, including:

Generative modeling in deep learning, as a training objective for GANs
Domain adaptation and transfer learning across different data distributions
Image processing and shape interpolation in computer vision
Acoustic measurements and spectral comparison in signal analysis
Computational biology, comparing cell-type distributions in single-cell data