When Code Goes Catastrophically Wrong

When Code Goes Catastrophically Wrong
Radio therapy machine

A tiny bug can sometimes lead to colossal consequences. After listening to too many technical podcasts in my spare time.. Let's dive into three infamous cases where a few lines of code caused chaos, destruction, and in one case, an interplanetary facepalm.

We'll also see how they could have been solved, in a very simplified way!

1. The Rocket That Went Boom: Ariane 5 Flight 501

ESA - Ariane 501 explosion
Source: European Space Agency

Imagine spending a decade and $7 billion to build a rocket, only to watch it self-destruct 37 seconds after launch. That's exactly what happened to the European Space Agency's Ariane 5 rocket in 1996. The culprit? A little thing called integer overflow.

The rocket's inertial reference system tried to stuff a 64-bit floating point number into a 16-bit integer. Oops! This caused the guidance system to go haywire, resulting in an un-commanded change of trajectory. In simpler terms, the rocket thought it was drunk and decided to break dance instead of flying straight.

How could this have been prevented? By using proper exception handling and range checking. Here's a simplified example of what they should have done:

def convert_velocity(velocity_64bit):
    max_16bit = 32767  # Maximum value for a 16-bit integer
    
    try:
        velocity_16bit = int(velocity_64bit)
        if velocity_16bit > max_16bit:
            raise ValueError("Velocity out of range for 16-bit conversion")
        return velocity_16bit
    except ValueError as e:
        log_error(f"Conversion error: {e}")
        return fallback_safe_velocity()

Simplified code in Python for easier understanding


The Ariane 5 used Ada, a language designed for embedded and real-time systems. Here's an Ada-like representation of the problem and a potential fix:

-- Original problematic code (simplified)
procedure Convert_Velocity is
   Velocity_64 : Long_Float;
   Velocity_16 : Integer_16;  -- 16-bit integer type
begin
   Velocity_64 := Get_Velocity;  -- This function returns a 64-bit float
   Velocity_16 := Integer_16(Velocity_64);  -- This conversion could cause overflow
end Convert_Velocity;

-- Improved version with error handling
procedure Convert_Velocity (Success : out Boolean) is
   Velocity_64 : Long_Float;
   Velocity_16 : Integer_16;
   Max_16 : constant := 32767;  -- Maximum value for 16-bit signed integer
begin
   Success := False;
   Velocity_64 := Get_Velocity;
   
   if Velocity_64 > Long_Float(Max_16) or Velocity_64 < Long_Float(-Max_16) then
      -- Log error and use fallback value
      Log_Error("Velocity out of range for 16-bit conversion");
      Velocity_16 := Fallback_Safe_Velocity;
   else
      Velocity_16 := Integer_16(Velocity_64);
      Success := True;
   end if;
exception
   when others =>
      Log_Error("Unexpected error in velocity conversion");
      Velocity_16 := Fallback_Safe_Velocity;
end Convert_Velocity;

Code in Ada

Lesson learned: Always validate your inputs and handle potential overflows!

2. The $125 Million Typo: NASA's Mars Climate Orbiter

File:Mars Climate Orbiter 2.jpg
Artist's rendering of the Mars Climate Orbiter Source: NASA/JPL/Corby Waste

In 1999, NASA's Mars Climate Orbiter got a little too close and personal with the Red Planet, disintegrating in its atmosphere. The reason? A simple unit conversion error. One team used metric units (newtons) while another used imperial units (pound-force). It's like trying to bake a cake with a recipe that switches between cups and milliliters without telling you.

This metric mixup caused the orbiter to approach Mars at the wrong angle, turning a scientific mission into a very expensive shooting star.

How could this planetary faux pas have been prevented? By using a clear unit standard and implementing rigorous code reviews. Here's a simple example of how they could have handled unit conversions:

class Force:
    def __init__(self, value, unit):
        self.value = value
        self.unit = unit.lower()

    def to_newtons(self):
        if self.unit == 'n' or self.unit == 'newtons':
            return self.value
        elif self.unit == 'lbf' or self.unit == 'pound-force':
            return self.value * 4.448222  # Conversion factor
        else:
            raise ValueError(f"Unsupported unit: {self.unit}")

# Usage
thrust = Force(100, 'lbf')
thrust_in_newtons = thrust.to_newtons()
print(f"Thrust: {thrust_in_newtons} N")

Simplified code in Python for easier understanding

The Mars Climate Orbiter software was written in C. Here's a C representation of the unit conversion issue and a potential solution:

#include <stdio.h>

// Original problematic code (simplified)
double apply_thrust(double force) {
    // Assume force is in pound-force, but the function expects newtons
    return calculate_trajectory(force);
}

// Improved version with explicit unit conversion
#define LBF_TO_NEWTON 4.448222

typedef enum {
    NEWTON,
    POUND_FORCE
} ForceUnit;

typedef struct {
    double value;
    ForceUnit unit;
} Force;

double force_to_newtons(Force force) {
    if (force.unit == NEWTON) {
        return force.value;
    } else if (force.unit == POUND_FORCE) {
        return force.value * LBF_TO_NEWTON;
    } else {
        fprintf(stderr, "Error: Unknown force unit\n");
        return 0.0;
    }
}

double apply_thrust(Force force) {
    double force_in_newtons = force_to_newtons(force);
    return calculate_trajectory(force_in_newtons);
}

int main() {
    Force thrust = {100.0, POUND_FORCE};
    double trajectory = apply_thrust(thrust);
    printf("Trajectory calculated with thrust: %f N\n", force_to_newtons(thrust));
    return 0;
}

Code in C

Lesson learned: Standardize your units and always double-check your conversions!

3. The Bug That Fried Patients: Therac-25 Radiation Therapy Machine

File:Therac25 Interface.png
Simulated computer interface Source:

In the mid-1980s, a radiation therapy machine called the Therac-25 began to malfunction in the worst way possible. Due to a race condition in the code, the machine would occasionally give patients radiation doses that were hundreds of times higher than intended. This tragic bug led to several deaths and numerous injuries.

The problem stemmed from a perfect storm of software issues: poor synchronization between concurrent processes, inadequate error checking, and a false sense of security from previous, safer models.

How could this have been prevented? By implementing thorough error checking, fail-safe mechanisms, and extensive testing. Previous models also had hardware interlocks to prevent such faults, but the Therac-25 had removed them, depending instead on software checks for safety.

Below is a very neatly compiled series of incidents of the patients that suffered through this machine. Courtey of Wikipedia writers.

The first incident occurred on June 3, 1985 at the Kennestone Regional Oncology Center in Marietta, Georgia. The patient was prescribed a 10-MeV electron treatment to her clavicle area, but when the machine turned on she felt a "tremendous force of heat... [a] red-hot sensation." There was no visible sign of tissue damage immediately following treatment, but after going home, the area began to swell and became extremely painful. The patient developed burns all the way through to her back; eventually she needed to have a breast removed and lost the use of her shoulder and arm. Following this incident, the hospital physicist inquired with AECL if an electron beam could be administered without the beam spreader plate in place. AECL incorrectly responded that it was not possible.

The second incident occurred in Hamilton, Ontario, Canada on July 26, 1985. The patient was receiving radiation treatment to a region near the hip. Six times the operator tried to administer treatment, but the machine shut down with an "H-tilt" error message. The operator did not know what this meant but reported it to the hospital technician. The machine's display read "no dose" after each treatment attempt, but when the patient died from cancer a few months later, an autopsy revealed that the patient had such intense radiation burns that he would have required a hip replacement. The incident was reported to AECL.

The third incident occurred in Yakima, Washington during December 1985. Similar to the first case, the patient received a radiation overdose but no cause could be found. The technicians at the hospital contacted AECL about the incident but AECL responded saying that an overdose was not possible and no other incidents had been reported.

The fourth and fifth incidents occurred at the East Texas Cancer Center in March and April 1986. These two incidents were similar because both patients were prescribed electron beam radiation, but during the setup for both treatments, the operator accidentally pressed X for X-ray, then quickly changed it to E for electron beam before turning on the beam. Both times, the display read MALFUNCTION 54 and displayed a gross under dose. The operator's manual made no mention of MALFUNCTION 54. In the first instance, the operator quickly restarted the treatment, but received the same error message. At this time the patient was pounding furiously at the door of the treatment room and was complaining about receiving an electric shock. The hospital shut down the machine for a day, during which engineers and technicians from the hospital and from AECL tested the machine but were unable to replicate the error. After the second incident that resulted in another apparent overdose, the hospital physician ran his own tests and was finally able to replicate the error, determining that it was caused by the speed at which the change from X-ray mode to electron mode occurred. Unfortunately, both patients involved in these incidents died from their radiation exposure.

The sixth and final incident involving the Therac-25 occurred in January 1987, again in Yakima, Washington. Similar to the first three incidents, the operator tried to administer a treatment but an ambiguous error message was displayed. Believing that little or no radiation had been delivered, the operator tried again. Again, the patient complained of a burning sensation and visible reddening of the skin occurred as in the last incident in Yakima. It was determined that the patient had received an overdose, but it was still unclear how it had occurred. AECL began investigating the incident and did find more software errors. Unfortunately, this patient died from complications related to the overdose in April that year.

Source

Here's a simplified example of how they could have added some basic safety checks:

class RadiationMachine:
    MAX_SAFE_DOSE = 5  # Grays

    def __init__(self):
        self.current_dose = 0
        self.is_beam_on = False

    def set_dose(self, dose):
        if dose > self.MAX_SAFE_DOSE:
            raise ValueError(f"Dose exceeds safe limit of {self.MAX_SAFE_DOSE} Gy")
        self.current_dose = dose

    def activate_beam(self):
        if not self.is_beam_on and self.current_dose <= self.MAX_SAFE_DOSE:
            self.is_beam_on = True
            print("Beam activated")
        else:
            print("ERROR: Cannot activate beam")

    def emergency_shutdown(self):
        self.is_beam_on = False
        self.current_dose = 0
        print("EMERGENCY SHUTDOWN ACTIVATED")

# Usage
machine = RadiationMachine()
try:
    machine.set_dose(3)
    machine.activate_beam()
except ValueError as e:
    print(f"Error: {e}")
    machine.emergency_shutdown()

The Therac-25 used a custom assembly language. Here's a simplified pseudocode representation in a mix of assembly-like syntax and higher-level constructs to illustrate the issues and a safer approach:

; Simplified pseudocode for Therac-25 (not actual assembly)

; Original problematic code (conceptual)
SetDose:
    MOV dose, A         ; Set dose from input
    JMP ActivateBeam    ; Immediately activate beam

ActivateBeam:
    MOV 1, beam_status  ; Turn beam on
    RET

; Improved version with safety checks
SetDose:
    MOV dose, A
    CMP A, MAX_SAFE_DOSE
    JG Error            ; Jump if dose > MAX_SAFE_DOSE
    MOV A, current_dose
    RET

ActivateBeam:
    CMP beam_status, 1
    JE Error            ; Jump if beam already on
    CMP current_dose, MAX_SAFE_DOSE
    JG Error            ; Jump if dose too high
    MOV 1, beam_status
    CALL InitiateSafetyChecks
    RET

InitiateSafetyChecks:
    ; (Additional safety check procedures)
    RET

Error:
    CALL EmergencyShutdown
    RET

EmergencyShutdown:
    MOV 0, beam_status
    MOV 0, current_dose
    ; (Additional shutdown procedures)
    RET

; Main control loop
Main:
    CALL GetUserInput
    CALL SetDose
    CALL ActivateBeam
    JMP Main

Code in assembly-like syntax

Lesson learned: When lives are at stake, implement multiple layers of safety checks and fail-safe mechanisms!

Conclusion

In conclusion, these cautionary tales remind us that in the world of coding, a small oversight can lead to massive consequences. Always validate your inputs, standardize your units, implement thorough error checking, and never, ever skip those code reviews.

Nowadays it's quite standard and even expected to write tests, so make sure to create meaninful ones - no excuses!

Wikipedia is an irresistable rabbithole
As a software engineer, I’m intimately familiar with the thrill of solving complex problems, the satisfaction of optimizing code, and the endless pursuit of knowledge in a rapidly evolving field. Yet.. there’s another realm that has captured my attention and countless hours of my time (and I’m sure many are

Read more