Fault tolerance
What Is Fault Tolerance?
Fault tolerance is the ability of a system to continue operating correctly in the presence of one or more faults affecting its components. Rather than assuming that hardware and software will always perform as designed, fault-tolerant engineering anticipates failure and builds in the structural and algorithmic mechanisms needed to detect errors, contain their effects, and maintain acceptable service. Fault tolerance is indispensable wherever failures carry high safety, financial, or mission consequences, including aviation, spacecraft, medical devices, and data center infrastructure.
Architectural Redundancy
The fundamental technique of fault tolerance is redundancy: providing backup components that can substitute for failed ones. Spatial redundancy replicates hardware so that correct outputs can be produced even when some replicas fail. Triple modular redundancy (TMR) is the canonical form: three identical modules compute the same function in parallel and a majority voter selects the result agreed upon by at least two of the three. A single module failure therefore has no effect on the output because the two healthy modules outvote the faulty one.
NASA has relied on TMR and related schemes in spacecraft flight computers since the Apollo era, and the technique remains standard in avionics and nuclear plant control systems. N-modular redundancy generalizes TMR to higher levels, tolerating more simultaneous failures at the cost of proportionally greater hardware overhead.
Time redundancy re-executes computations and compares successive results, trading latency for reduced hardware cost. Information redundancy adds error-detecting and error-correcting codes to data stored or transmitted across the system.
Byzantine Fault Tolerance
Classical majority voting assumes that faulty modules either fail silently or produce the same wrong answer. Byzantine faults are more general: a faulty component may produce arbitrary, inconsistent, or adversarially crafted outputs, potentially sending different values to different receivers. The Byzantine generals problem and its solutions established that a system with f Byzantine faulty nodes can reach consensus among correct nodes only if the total number of nodes is at least 3f+1. Byzantine fault-tolerant (BFT) protocols such as Practical BFT are used in distributed databases, blockchain networks, and safety-critical control systems that must withstand both accidental failures and deliberate attacks.
Fault-Tolerant Control
Fault-tolerant control (FTC) keeps a dynamic system stable and meeting performance objectives after actuator or sensor failures. Active FTC systems include an online fault detection and identification module that triggers controller reconfiguration when a fault is detected. Passive FTC designs a single robust controller that maintains acceptable performance across an entire family of anticipated fault conditions without requiring explicit detection.
Applications range from aircraft flight control, where an actuator jam must not cause loss of control, to autonomous vehicles, where sensor failures must be compensated to maintain safe path following. IEEE Transactions on Control Systems Technology regularly publishes advances in FTC design and validation methodology.
Radiation Hardening
In space and high-radiation environments, ionizing particles can flip bits in memory, corrupt processor registers, or permanently damage transistor junctions. Radiation hardening addresses this through both design and process techniques. Radiation-hardened-by-design (RHBD) circuits add layout features such as guard rings and enlarged cell spacing to reduce charge collection. Radiation-hardened process technologies use insulating substrates (silicon-on-insulator) that limit the volume from which charge can be collected.
Software-level mitigation using error-correcting codes for memory and scrubbing routines that periodically rewrite registers complements hardware hardening. The combination allows spacecraft computers to maintain data integrity over mission lifetimes measured in years against accumulated radiation doses that would quickly corrupt commercial electronics.
Applications
- Commercial aircraft flight control computers use TMR with dissimilar software on each channel to tolerate both hardware failures and common-mode software bugs.
- Data centers implement RAID storage, redundant power supplies, and hot-standby servers to achieve high availability service-level agreements.
- Implantable medical devices such as pacemakers apply watchdog timers and ECC memory to maintain function despite soft errors.
- Blockchain networks use Byzantine fault-tolerant consensus protocols to agree on transaction ordering despite malicious or faulty participants.
- Autonomous underwater vehicles reconfigure thruster allocation in real time to maintain controllability after propulsion failures.
- High-energy physics detector readout electronics are radiation hardened to survive millions of rads of accumulated dose in collider environments.