3/10/2005

On Perfect Software: It's the Process, Stupid!

Recently an old article I'd seen several times before surfaced on a couple of blogs, and I went back and re-read it. It's about the On-board shuttle group, and how they write perfect code. And I mean PERFECT. This is the software that makes the Space Shuttle go. What makes it remarkable is how well the software works. This software never crashes. It never needs to be re-booted. This software is bug-free. It is perfect, as perfect as human beings have ever achieved. The last three versions of the program -- each 420,000 lines long --had just one error each. I've lifted out and edited a few lines to save you some searching and reading time; what follows is really the essence of this whole concept:

This is all the work of 260 women and men based in an anonymous office building across the street from the Johnson Space Center in Clear Lake, Texas, southeast of Houston. Their prowess is world renowned: the shuttle software group is one of just four outfits in the world to win the Level 5 ranking of the government's Software Engineering Institute.

But, HOW do they write the right stuff?

The answer is, it's their process. The group's most important creation is not the perfect software they write -- it's the process they invented that writes the perfect software, and the process can be reduced to these four simple propositions:

1. The product is only as good as the plan for the product.

At the on-board shuttle group, about one-third of the process of writing software happens before anyone writes a line of code. NASA and the Lockheed Martin group agree in the most minute detail about everything the new code is supposed to do -- and they commit that understanding to paper, with the kind of specificity and precision usually found in blueprints. The specs for the current program fill 30 volumes and run 40,000 pages.

Most organizations launch into even big projects without planning what the software must do in blueprint-like detail. So after coders have already started writing a program, the customer is busily changing its design. The result is chaotic, costly programming where code is constantly being changed and infected with errors, even as it is being designed.


2. The best teamwork is a healthy rivalry.

Within the software group, there are subgroups and subcultures.

The central group breaks down into two key teams: the coders - the people who sit and write code -- and the verifiers -- the people who try to find flaws in the code. The two outfits report to separate bosses and function under opposing marching orders. The development group is supposed to deliver completely error-free code, so perfect that the testers find no flaws at all. The testing group is supposed to pummel away at the code with flight scenarios and simulations that reveal as many flaws as possible.

The results of this friendly rivalry: the shuttle group now finds 85% of its errors before formal testing begins, and 99.9% before the program is delivered to NASA.

3. The database is the software base.

There is the software. And then there are the databases beneath the software, two enormous databases, encyclopedic in their comprehensiveness.

One is the history of the code itself -- with every line annotated, showing every time it was changed, why it was changed, when it was changed, what the purpose of the change was, what specifications documents detail the change. Everything that happens to the program is recorded in its master history. The genealogy of every line of code -- the reason it is the way it is -- is instantly available to everyone.

The other database -- the error database -- stands as a kind of monument to the way the on-board shuttle group goes about its work. Here is recorded every single error ever made while writing or working on the software, going back almost 20 years. For every one of those errors, the database records when the error was discovered; what set of commands revealed the error; who discovered it; what activity was going on when it was discovered -- testing, training, or flight.

4. Don't just fix the mistakes -- fix whatever permitted the mistake in the first place.


The process is so pervasive, it gets the blame for any error -- if there is a flaw in the software, there must be something wrong with the way its being written, something that can be corrected. Any error not found at the planning stage has slipped through at least some checks. Why? Is there something wrong with the inspection process? Does a question need to be added to a checklist?

The way the process works, it not only finds errors in the software. The process finds errors in the process.

In the history of human technology, nothing has become as essential as fast as software. Virtually everything -- from the international monetary system and major power plants to blenders and microwave ovens -- runs on software. In office buildings, the elevators, the lights, the water, the air conditioning are all controlled by software. In cars, the transmission, the ignition timing, the air bag, even the door locks are controlled by software. In most cities so are the traffic lights. Almost every written communication that's more complicated than a postcard depends on software; every phone conversation and every overnight package delivery requires it. And yet, 80% of major organizations basically write software that sucks, because they don't have a process.

Think about it.