Of Aviation Crashes and Software Bugs

I just found out that Stephen Colbert's father and two brothers died in a plane crash on September 11, 1974. Maybe everybody knows this - I'm not sure because I haven't watched TV in years, so I live in a sort of alternate reality. My only exposure to TV are YouTube clips of Jon Stewart, Colbert, and lots of Dora The Explorer (Jon Stewart is my favorite but Swiper The Fox is a close second, don't tell my kids though). Now, I may not have TV to keep me informed, but I do read aircraft accident reports and transcripts from cockpit voice recorders. That doesn't help in small talk with the neighbors, but you read some amazing stuff.

For example, in the accident that killed Colbert's father the pilots were chatting about politics and used cars during the landing approach. They ignored their altitude and eventually ran the plane into the ground about 3 miles away from the destination airport. The report by the National Transportation Safety Board (NTSB) states that "both crew members [first officer and captain] expressed strong views and mild aggravation concerning the subjects discussed." Since the full CVR transcript is not available we're free to imagine a democrat and a republican arguing amid altitude alerts.

Aviation accidents are both tragic and fascinating; few accidents can be attributed to a single factor and there is usually, well, a series of unfortunate events leading to a crash. The most interesting CVR transcript I've read is Aeroperu 603. It covers an entire flight from the moment the airplane took off with its static ports taped over - causing airspeed, altitude, and vertical speed indicators to behave erratically and provide false data - until the airplane inverted into the Pacific Ocean after its left wing touched the sea, concluding a mad, slow descent in which crew members were bombarded with multiple, false, and often conflicting flight alerts. The transcript captures the increasing levels of desperation, the various alerts, and the plentiful cussing throughout the flight (there's also audio with subtitles). As you read it your brain hammers the question: how do we build stuff so things like this can't happen?

Aeroperu 603 Static Ports
Static ports covered by duct tape in Aeroperu 603

The immediate cause of the Aeroperu problem was a mistake by a ground maintenance worker who left duct tape over the airplane's static ports. But there were a number of failures along the way in maintenance procedures, pilot actions, air traffic control, and arguably aircraft design. This is where agencies like the NTSB and their counterparts abroad do their brilliant and noble work. They analyze the ultimate reason behind each error and failure and then issue recommendations to eradicate whole classes of problems. It's like the five whys of the Toyota Production System coupled with fixes and on steroids. Fixes are deep and broad, never one-off band aids.

Take the Colbert plane crash. You could define the problem as "chatter during landing" and prohibit that. But the NTSB went beyond, they saw the problem as "lack of professionalism" and issued two recommendations to the FAA with a series of concrete steps towards boosting professionalism in all aspects of flight. Further NTSB analysis and recommendations culminated a few years later in the Sterile Cockpit Rule, which lays down precise rules for critical phases of flight including take off, landing, and operations under 10,000 feet. Each aviation accident, error, and causal factor spurs recommendations to prevent it, and anything like it, from ever happening again. Because the solutions are deep, broad, and smart we have achieved remarkable safety in flight.

In other words, it's the opposite of what we do in software development and computer security. We programmers like our fixes quick and dirty, yes sirree, "patches" we call them. It doesn't matter how critical the software is. Until 1997 Sendmail powered 70% of the Internet's reachable SMTP servers, qualifying it as critical by a reasonable measure (its market share has since decreased). What was the security track record? We had bug after bug after bug, many with disastrous security implications, and all of them fixed with a patch as specific as possible, thereby guaranteeing years of continued new bugs and exploits. Of course this is not as serious as human life, but for software it was pretty damn serious: these were bugs allowing black hats to own thousands of servers remotely.

And what have we learned? If you fast forward a few years, replace "Sendmail" with "WordPress" and "buffer overflow" with "SQL injection/XSS", cynics might say "nothing." We have different technologies but the same patch-and-run mindset. I upgraded my blog to WordPress 2.5.1 the other day and boy I feel safe already! Security problems are one type of bug, the same story happens for other problems. It's a habit we programmers have of not fixing things deeply enough, of blocking the sun with a sieve.

We should instead be fixing whole classes of problems so that certain bugs are hard or impossible to implement. This is easier than it sounds. Dan Bernstein wrote a replacement for Sendmail called qmail and in 1997 offered a $500 reward for anyone who found a security vulnerability in his software. The prize went unclaimed and after 10 years he wrote a paper reviewing his approaches, what worked, and what could be better. He identifies only three ways for us to make true progress:

  1. Reduce the bug rate per line of code
  2. Reduce the amount of code
  3. Reduce trusted code (which is different than least privilege)

This post deals only with 1 above, I hope to write about the other two later on. Reducing the bug rate is a holy grail in programming and qmail was very successful in this area. I'm sure it didn't hurt that Bernstein is a genius, but his techniques are down to earth:

For many years I have been systematically identifying error-prone programming habits—by reviewing the literature, analyzing other people’s mistakes, and analyzing my own mistakes—and redesigning my programming environment to eliminate those habits. (...)

Most programming environments are meta-engineered to make typical software easier to write. They should instead be meta-engineered to make incorrect software harder to write.

In the 1993 book Writing Solid Code Steve Maguire gives similar advice:

The most critical requirement for writing bug-free code is to become attuned to what causes bugs. All of the techniques in this book are the result of programmers asking themselves two questions over and over again, year after year, for every bug found in their code:

  • How could I have automatically detected this bug?
  • How could I have prevented this bug?

For a concrete example, look at SQL Injection. How do you prevent it? If you prevent it by remembering to sanitize each bit of input that goes to the database, then you have not solved the problem, you are using a band aid with a failure rate - it's Russian Roulette. But you can truly solve the problem by using an architecture or tools such that SQL Injections are impossible to cause. The Ruby on Rails ActiveRecord does this to some degree. In C# 3.0, a great language in many regards, SQL Injections are literally impossible to express in the language's built-in query mechanism. This is the kind of all-encompassing, solve-it-once-and-for-all solution we must seek.

It's important to take a broad look at our programming environments to come up with solutions for preventing bugs. This mindset matters more than specific techniques; we've got to be in the habit of going well beyond the first "why". Why have we wasted hundreds of thousands of man hours looking for memory leaks, buffer overflows, and dangling pointers in C/C++ code? It wasn't just because you forgot to free() or you kept a pointer improperly, no. That was a symptom. The reality is that for most projects using C/C++ was the bug, it didn't just facilitate bugs. We can't tolerate environments that breed defects instead of preventing them.

Multi-threaded programming is another example of a perverse environment where things are opposite of what they should be: writing correct threading code is hard (really hard), but writing threading bugs is natural and takes no effort. Any design that expects widespread mastery of concurrency, ordering, and memory barriers as a condition for correctness is doomed from the start. It needs to be fixed so that bug-free code is automatic rather than miraculous.

There are a number of layers that can prevent a bug from infecting your code: software process, tools, programming language, libraries, architecture, unit tests, your own habits, etc. Troubleshooting this whole programming stack, not just code, is how we can add depth and breadth to our fixes and make progress. The particulars depend on what kind of programming you do, but here are some questions that might be worth asking, in the spirit of the questions above, when you find a bug:

  • Are you using the right programming language? Does it handle memory for you? Does it help minimize lines of code and duplication? (Here's a good overall comparison and an interesting empirical study)
  • Could a better library or framework have prevented the bug (as in the SQL Injection example above)?
  • Can architecture changes prevent that class of bug or mitigate their impact?
  • Why did your unit tests fail to catch the bug?
  • Could compiler warnings, static analysis, or other tools have found this bug?
  • Is it at all possible to avoid explicit threading? If so, shun threads because they're a bad idea. Otherwise, can you eliminate bugs by isolating the threads (reduce shared state aggressively, use read-only data structures, use as few locks as possible).
  • Is your error-handling strategy simple and consistent? Can you centralize and minimize catch blocks for exceptions?
  • Are your class interfaces bug prone? Can you change them to make correct usage obvious, or better yet, incorrect usage impossible?
  • Could argument validation have prevented this bug? Assertions?
  • Would you have caught this bug if you regularly stepped through newly written code in a debugger while thinking of ways to make the code fail?
  • Could software process tools have prevented this bug? Continuous integration, code reviews, programming conventions and so on can help a lot. Can you modify your processes to reduce bug rate?
  • Have you read Code Complete and the Pragmatic Programmer?

As airplanes still crash we'll always have our bugs, but we could do a lot better by improving our programming ecosystem and habits rather than just fixing the problem of the hour. The outstanding work of the NTSB is great inspiration. I'm still scared of flying though - think of all the software in those planes!


Richard Feynman, the Challenger Disaster, and Software Engineering

Challenger Crew

On January 28th, 1986, Space Shuttle Challenger was launched at 11:38am on the 6-day STS-51-L mission. During the first 3 seconds of liftoff the o-rings (o-shaped loops used to connect two cylinders) in the shuttle's right-hand solid rocket booster (SRB) failed. As a result hot gases with temperatures above 5,000 °F leaked out of the booster, vaporized the o-rings, and damaged the SRB's joints. The shuttle started its ascent, but seventy two seconds later the compromised SRB pulled away from the Challenger, leading to sudden lateral acceleration. Pilot Michael J. Smith uttered "Uh oh" just before the shuttle broke up. Torn apart by excessive force, it disintegrated rapidly. Within seconds the severed but nearly intact crew cabin began to free fall and seven astronauts plunged to their deaths. I was a child then and remember watching in horror as Brazilian TV showed the footage.

Challenger ExplosionAt the time I didn't know that SRB engineers had previously warned about problems in the o-rings, but had been dismissed by NASA management. I also didn't know who Richard Feynman or Ronald Reagan were. It turns out that President Reagan created the Rogers Commission to investigate the disaster. Physicist Feynman was invited as a member, but his independent intellect and direct methods were at odds with the commission's formal approach. Chairman Rogers, a politician, remarked that Feynman was "becoming a real pain." In the end the commission produced a report, but Feynman's rebellious opinions were kept out of it. When he threatened to take his name out of the report altogether, they agreed to include his thoughts as Appendix F - Personal Observations on Reliability of Shuttle.

It is a good thing it was included, because the 10-page document is a work of brilliance. It has deep insights into the nature of engineering and into how reliable systems are built. And you see, I didn't put 'software' in the title just to trick you. Feynman's conclusions are general and very much relevant for software development. After all, as Steve McConnell tirelessly points out, there is much in common between software and other engineering disciplines. But don't take my word for it. Take Feynman's:

The Space Shuttle Main Engine was handled in a different manner, top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the material and components. Then when troubles are found in the bearings, turbine blades, coolant pipes, etc., it is more expensive and difficult to discover the causes and make changes.

So software is not the only discipline where the longer a defect stays in the process, the more expensive it is to fix. It's also not the only discipline where a "top down" design, made in ignorance of detailed bottom-up knowledge, leads to problems. There is however a difference here between design and requirements. The requirements for the engine were clear and well defined. You know, go to space and back, preferably without blowing up. Feynman is arguing not so much against Joel's functional specs, but rather against top down design such as that advocated by the UML as blueprint crowd. On goes Feynman:

The Space Shuttle Main Engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of, or outside of, previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in the top-down manner, they are difficult to find and fix. The design aim of a lifetime of 55 missions equivalent firings (27,000 seconds of operation, either in a mission of 500 seconds, or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts, such as turbopumps, bearings, sheet metal housings, etc.

Richard Feynman

Unfortunate top down manner, difficult to find and fix, failure to meet design requirements, frequent maintenance. Sound familiar? Is software engineering really a world apart, removed from its sister disciplines? Feynman elaborates on the difficulty in achieving correctness due to the 'top down' approach:

Many of these solved problems are the early difficulties of a new design. Naturally, one can never be sure that all the bugs are out, and, for some, the fix may not have addressed the true cause.

Whether it's the Linux kernel or shuttle engines, there are fundamental cross-discipline issues in design. One of them is the folly of a top-down approach, which ignores the reality that detailed knowledge about the bottom parts is a necessity, not something that can be abstracted away. He then talks about the avionics system, which was done by a different group at NASA:

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product.

Yes, go ahead and pinch yourself: this is unit testing described in 1986 by the Feynman we know and love. Not only unit testing, but 'step by step increase' in scope and 'adversarial testing attitude'. It's common to hear we suck at software because it's a "young discipline", as if the knowledge to do right has not yet been attained. Bollocks! We suck because we constantly ignore well-established, well-known, empirically proven practices. In this regard management is also to blame, especially when it comes to dysfunctional schedules, wrong incentives, poor hiring, and demoralizing policies. Management/engineering tensions and the effects of bad management are keenly discussed by Feynman in his report. Here is one short example:

To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history.

This is one of many passages. I picked it because it touches on other points, such as the 'attitude of highest quality' and the 'process of gradually fooling oneself'. I encourage you to read the whole report, unblemished by yours truly. With respect to software, I take out four main points:

  • Engineering can only be as good as its relationship with management
  • Big design up front is foolish
  • Software has much in common with other engineering disciplines
  • Reliable systems are built by rigorously tested, incremental bottom-up engineering with an 'attitude of highest quality'

There are other interesting themes in there, and Feynman's insight can't be captured in a few bullet points, much less by me. What do you get out of it?

Feynman's last board at Caltech