Of Aviation Crashes and Software Bugs

I just found out that Stephen Colbert's father and two brothers died in a plane crash on September 11, 1974. Maybe everybody knows this - I'm not sure because I haven't watched TV in years, so I live in a sort of alternate reality. My only exposure to TV are YouTube clips of Jon Stewart, Colbert, and lots of Dora The Explorer (Jon Stewart is my favorite but Swiper The Fox is a close second, don't tell my kids though). Now, I may not have TV to keep me informed, but I do read aircraft accident reports and transcripts from cockpit voice recorders. That doesn't help in small talk with the neighbors, but you read some amazing stuff.

For example, in the accident that killed Colbert's father the pilots were chatting about politics and used cars during the landing approach. They ignored their altitude and eventually ran the plane into the ground about 3 miles away from the destination airport. The report by the National Transportation Safety Board (NTSB) states that "both crew members [first officer and captain] expressed strong views and mild aggravation concerning the subjects discussed." Since the full CVR transcript is not available we're free to imagine a democrat and a republican arguing amid altitude alerts.

Aviation accidents are both tragic and fascinating; few accidents can be attributed to a single factor and there is usually, well, a series of unfortunate events leading to a crash. The most interesting CVR transcript I've read is Aeroperu 603. It covers an entire flight from the moment the airplane took off with its static ports taped over - causing airspeed, altitude, and vertical speed indicators to behave erratically and provide false data - until the airplane inverted into the Pacific Ocean after its left wing touched the sea, concluding a mad, slow descent in which crew members were bombarded with multiple, false, and often conflicting flight alerts. The transcript captures the increasing levels of desperation, the various alerts, and the plentiful cussing throughout the flight (there's also audio with subtitles). As you read it your brain hammers the question: how do we build stuff so things like this can't happen?

Aeroperu 603 Static Ports
Static ports covered by duct tape in Aeroperu 603

The immediate cause of the Aeroperu problem was a mistake by a ground maintenance worker who left duct tape over the airplane's static ports. But there were a number of failures along the way in maintenance procedures, pilot actions, air traffic control, and arguably aircraft design. This is where agencies like the NTSB and their counterparts abroad do their brilliant and noble work. They analyze the ultimate reason behind each error and failure and then issue recommendations to eradicate whole classes of problems. It's like the five whys of the Toyota Production System coupled with fixes and on steroids. Fixes are deep and broad, never one-off band aids.

Take the Colbert plane crash. You could define the problem as "chatter during landing" and prohibit that. But the NTSB went beyond, they saw the problem as "lack of professionalism" and issued two recommendations to the FAA with a series of concrete steps towards boosting professionalism in all aspects of flight. Further NTSB analysis and recommendations culminated a few years later in the Sterile Cockpit Rule, which lays down precise rules for critical phases of flight including take off, landing, and operations under 10,000 feet. Each aviation accident, error, and causal factor spurs recommendations to prevent it, and anything like it, from ever happening again. Because the solutions are deep, broad, and smart we have achieved remarkable safety in flight.

In other words, it's the opposite of what we do in software development and computer security. We programmers like our fixes quick and dirty, yes sirree, "patches" we call them. It doesn't matter how critical the software is. Until 1997 Sendmail powered 70% of the Internet's reachable SMTP servers, qualifying it as critical by a reasonable measure (its market share has since decreased). What was the security track record? We had bug after bug after bug, many with disastrous security implications, and all of them fixed with a patch as specific as possible, thereby guaranteeing years of continued new bugs and exploits. Of course this is not as serious as human life, but for software it was pretty damn serious: these were bugs allowing black hats to own thousands of servers remotely.

And what have we learned? If you fast forward a few years, replace "Sendmail" with "WordPress" and "buffer overflow" with "SQL injection/XSS", cynics might say "nothing." We have different technologies but the same patch-and-run mindset. I upgraded my blog to WordPress 2.5.1 the other day and boy I feel safe already! Security problems are one type of bug, the same story happens for other problems. It's a habit we programmers have of not fixing things deeply enough, of blocking the sun with a sieve.

We should instead be fixing whole classes of problems so that certain bugs are hard or impossible to implement. This is easier than it sounds. Dan Bernstein wrote a replacement for Sendmail called qmail and in 1997 offered a $500 reward for anyone who found a security vulnerability in his software. The prize went unclaimed and after 10 years he wrote a paper reviewing his approaches, what worked, and what could be better. He identifies only three ways for us to make true progress:

Reduce the bug rate per line of code
Reduce the amount of code
Reduce trusted code (which is different than least privilege)

This post deals only with 1 above, I hope to write about the other two later on. Reducing the bug rate is a holy grail in programming and qmail was very successful in this area. I'm sure it didn't hurt that Bernstein is a genius, but his techniques are down to earth:

For many years I have been systematically identifying error-prone programming habits—by reviewing the literature, analyzing other people’s mistakes, and analyzing my own mistakes—and redesigning my programming environment to eliminate those habits. (...)

Most programming environments are meta-engineered to make typical software easier to write. They should instead be meta-engineered to make incorrect software harder to write.

In the 1993 book Writing Solid Code Steve Maguire gives similar advice:

The most critical requirement for writing bug-free code is to become attuned to what causes bugs. All of the techniques in this book are the result of programmers asking themselves two questions over and over again, year after year, for every bug found in their code:

How could I have automatically detected this bug?

How could I have prevented this bug?

For a concrete example, look at SQL Injection. How do you prevent it? If you prevent it by remembering to sanitize each bit of input that goes to the database, then you have not solved the problem, you are using a band aid with a failure rate - it's Russian Roulette. But you can truly solve the problem by using an architecture or tools such that SQL Injections are impossible to cause. The Ruby on Rails ActiveRecord does this to some degree. In C# 3.0, a great language in many regards, SQL Injections are literally impossible to express in the language's built-in query mechanism. This is the kind of all-encompassing, solve-it-once-and-for-all solution we must seek.

It's important to take a broad look at our programming environments to come up with solutions for preventing bugs. This mindset matters more than specific techniques; we've got to be in the habit of going well beyond the first "why". Why have we wasted hundreds of thousands of man hours looking for memory leaks, buffer overflows, and dangling pointers in C/C++ code? It wasn't just because you forgot to free() or you kept a pointer improperly, no. That was a symptom. The reality is that for most projects using C/C++ was the bug, it didn't just facilitate bugs. We can't tolerate environments that breed defects instead of preventing them.

Multi-threaded programming is another example of a perverse environment where things are opposite of what they should be: writing correct threading code is hard (really hard), but writing threading bugs is natural and takes no effort. Any design that expects widespread mastery of concurrency, ordering, and memory barriers as a condition for correctness is doomed from the start. It needs to be fixed so that bug-free code is automatic rather than miraculous.

There are a number of layers that can prevent a bug from infecting your code: software process, tools, programming language, libraries, architecture, unit tests, your own habits, etc. Troubleshooting this whole programming stack, not just code, is how we can add depth and breadth to our fixes and make progress. The particulars depend on what kind of programming you do, but here are some questions that might be worth asking, in the spirit of the questions above, when you find a bug:

Are you using the right programming language? Does it handle memory for you? Does it help minimize lines of code and duplication? (Here's a good overall comparison and an interesting empirical study)
Could a better library or framework have prevented the bug (as in the SQL Injection example above)?
Can architecture changes prevent that class of bug or mitigate their impact?
Why did your unit tests fail to catch the bug?
Could compiler warnings, static analysis, or other tools have found this bug?
Is it at all possible to avoid explicit threading? If so, shun threads because they're a bad idea. Otherwise, can you eliminate bugs by isolating the threads (reduce shared state aggressively, use read-only data structures, use as few locks as possible).
Is your error-handling strategy simple and consistent? Can you centralize and minimize catch blocks for exceptions?
Are your class interfaces bug prone? Can you change them to make correct usage obvious, or better yet, incorrect usage impossible?
Could argument validation have prevented this bug? Assertions?
Would you have caught this bug if you regularly stepped through newly written code in a debugger while thinking of ways to make the code fail?
Could software process tools have prevented this bug? Continuous integration, code reviews, programming conventions and so on can help a lot. Can you modify your processes to reduce bug rate?
Have you read Code Complete and the Pragmatic Programmer?

As airplanes still crash we'll always have our bugs, but we could do a lot better by improving our programming ecosystem and habits rather than just fixing the problem of the hour. The outstanding work of the NTSB is great inspiration. I'm still scared of flying though - think of all the software in those planes!

22 Comments

Many But Finite

Tech and science for curious people.

Of Aviation Crashes and Software Bugs