Stereo Vision

TOPIC AREA

What Is Stereo Vision?

Stereo vision is a computational technique that recovers the three-dimensional structure of a scene from two or more images taken from slightly different viewpoints. The approach is inspired by human binocular vision: the small horizontal offset between the left and right eyes causes nearby objects to appear displaced relative to distant ones, and the visual system uses this disparity to perceive depth. Machine stereo vision systems replicate this principle with calibrated cameras, computing per-pixel or region-level depth estimates from the geometric relationship between corresponding image features in the two views.

Stereo vision occupies a central position in computer vision because it provides metric depth information from passive optical sensors without requiring active illumination. This distinguishes it from time-of-flight and structured light sensors, which emit their own signals and are affected by ambient conditions and target reflectance. A calibrated stereo camera pair can recover absolute distances, making stereo vision useful wherever both rich visual texture and accurate depth are needed.

Stereo Image Processing and Calibration

Before depth can be computed, a stereo rig must be geometrically calibrated to establish the intrinsic parameters of each camera (focal length, principal point, lens distortion) and the extrinsic transformation (rotation and translation) between them. Calibration typically uses a known planar target such as a checkerboard imaged in multiple poses. After calibration, images are rectified so that corresponding points lie on the same horizontal scan line, reducing the correspondence search to a one-dimensional problem. OpenCV's camera calibration documentation provides widely used open-source implementations of Zhang's calibration method and stereo rectification.

Disparity Maps and Correspondence Estimation

The disparity map is the primary output of a stereo matching algorithm: a dense image where each pixel's value encodes how far that pixel's position has shifted between the left and right images. Large disparities correspond to nearby objects; small disparities correspond to distant objects. Computing accurate disparity maps is the central computational challenge of stereo vision. Block matching algorithms compare small image patches around each pixel and select the shift that minimizes a photometric cost function. Semi-global matching, introduced by Hirschmuller, improves on local methods by enforcing consistency along multiple scan-line paths. IEEE Transactions on Pattern Analysis and Machine Intelligence publications on stereo matching document benchmark evaluations that track progress on challenging datasets.

Depth Estimation and 3D Reconstruction

Converting a disparity map to metric depth requires knowledge of the camera baseline (the physical separation between the two lenses) and focal length. Depth is inversely proportional to disparity: at fixed baseline and focal length, a pixel with twice the disparity is at half the distance. Depth estimates are assembled into point clouds or dense surface meshes by back-projecting each disparity measurement into 3D space. Multiple overlapping stereo views can be fused to produce complete 3D reconstructions of objects and scenes.

Deep Learning Approaches

Convolutional neural networks have substantially improved stereo disparity estimation over classical methods on benchmark datasets such as KITTI and Middlebury. End-to-end networks learn to extract features, compute matching costs, and regularize disparity maps jointly, avoiding the hand-crafted cost functions and aggregation heuristics of classical pipelines. Research on learned stereo matching published on arXiv has demonstrated sub-pixel accuracy on structured datasets, though robustness to lighting changes and textureless regions remains an active area.

Applications

Autonomous vehicles use stereo vision to estimate the distance to pedestrians, cyclists, and other vehicles ahead of the host car.
Industrial robots use stereo cameras for bin picking, where accurate 3D localization of randomly arranged parts is required for grasp planning.
Augmented reality headsets use stereo depth to composite virtual objects onto physical surfaces at correct apparent distances.
Aerial photogrammetry systems use stereo imagery from overlapping drone passes to produce high-resolution terrain maps.
Medical endoscopy systems use stereo imaging to provide surgeons with depth perception during minimally invasive procedures.

Topics in this Area

Stereo image processing