When Failure is Not an Option

Some organizations seem to have purged “human error,” operating highly complex and hazardous technological systems essentially without mistakes. How do they do it?

Robert Poolarchive page

July 1, 1997

Success is much harder to analyze than failure. When things go wrong in a chemical plant or space program, it’s usually possible to figure out the causes and resolve to avoid those things in the future. But when things go right, it’s difficult to know why. Which factors were important to the success, and which weren’t? Was the success due to skill, or just luck? If we are to learn to deal with hazardous technologies, our best bet is to look for organizations that manage risk successfully and see how they do it.

This is the goal of the high-reliability organization project at the University of California, Berkeley. For more than a decade, Todd La Porte, Karlene Roberts, and Gene Rochlin have been studying groups that seem to do the impossible: operate highly complex and hazardous technological systems essentially without mistakes. The U.S. air traffic control system, for instance, handles tens of thousands of flights a day around the country. Air traffic controllers are not only responsible for choreographing the takeoffs and landings of dozens or hundreds of flights per hour at airports but also for directing the flight paths of the planes so that each keeps a safe distance from the others. The success is unequivocal: for more than a decade none of the aircraft monitored on the controllers’ radar screens has collided with another. Yet the intricate dance of planes approaching and leaving airports, crisscrossing one another’s paths at several hundred miles an hour, creates plenty of opportunity for error. This record of safety is not due to extremely good luck, the three Berkeley researchers conclude, but to the fact that the institution has learned how to deal effectively with a complex, hazardous technology.

Perhaps the most impressive organizations they have studied are the nuclear aircraft carriers of the U.S. Navy. While it’s impossible for anyone who hasn’t worked on such a ship to truly understand the complexity, stress, and hazards of its operations, this description by a carrier officer to the Berkeley researchers offers a taste:

So you want to understand an aircraft carrier? Well, just imagine that it’s a busy day, and you shrink San Francisco Airport to only one short runway and one ramp and gate. Make planes take off and land at the same time, at half the present time interval, rock the runway from side to side, and require that everyone who leaves in the morning returns that same day. Then turn off the radar to avoid detection, impose strict controls on radios, fuel the aircraft in place with their engines running, put an enemy in the air, and scatter live bombs and rockets around. Now wet the whole thing down with salt water and oil, and man it with 20-year-olds, half of whom have never seen an airplane close up. Oh, and by the way, try not to kill anyone.

A Nimitz-class carrier flies ninety aircraft of seven different types. These aircraft have only several hundred feet in which to take off and land instead of the mile or more available at commercial airports, so they need help. At takeoff, the planes are catapulted by steam-powered slingshots that accelerate them from standstill to 140 knots (160 miles per hour) in just over two seconds. As each plane is moved into place on the steam catapult, crewmen check it one last time to make sure that the control surfaces are functioning and that no fuel leaks or other problems are visible. The catapult officer sets the steam pressure for each launch depending on the weight of the plane and wind conditions. The spacing of the launches-about every 50 seconds-leaves no time for errors.

But it is the recovery of the planes that is truly impressive. They approach the flight deck at 120 to 130 knots with a tail hook hanging down to catch one of four arresting wires stretched across the deck. As a plane approaches, the pilot radios his or her fuel level. With this information, the people in charge of the arresting gear calculate the weight of the plane and figure the proper setting for the arresting-gear braking machines. If the pressure is set too low, the plane may not stop soon enough and so topple off the end of the deck into the sea. If the wire is too taut, it could pull the tail hook off or else snap and lash out across the deck, injuring or killing anyone in its path. The pressure for each of the four wires is set individually by a single seaman.

Meanwhile, landing signal officers are watching the approach of the plane, advising the pilot and then-if everything appears right-okaying the landing. Just as the plane touches down, the pilot gives it full throttle so that if the hook does not catch, the plane will be going fast enough to take off and come around again. If the hook does catch a wire, the plane is slammed to a halt within about two seconds and 300 feet. As soon as the plane is down and stopped, “yellow shirts” rush to it to check the hook and to get the plane out of the way of the next one. As the arresting wires are pulled back, other crewmen check them for frays. Then it all begins again. The cycle has lasted about 60 seconds.

The launching and recovery are only part of a much larger process including maintenance, fueling and arming, and maneuvering and parking the planes on a crowded deck. What makes the process all so truly astonishing is that it is done not with people who have been working together for years but with a crew that turns over regularly. As writer John Pfeiffer observed, “The captain will be aboard for only three years, the 20 senior officers for about two and a half; most of the more than 5,000 enlisted men and women will leave the Navy or be transferred after their three-year carrier stints. Furthermore, they are predominantly teenagers, so that the average age aboard a carrier comes to a callow 20.”

What sort of organization can operate so reliably under such handicaps? La Porte, Roberts, and Rochlin spent a great deal of time on several carriers both in port and at sea, during training and on active duty, and they believe they understand at least part of the answer.

On the surface, an aircraft carrier appears to be organized along traditional hierarchical lines, with authority running from the captain down through the ranks in a clearly defined pattern. And indeed, much of the day-to-day operation of the ship does proceed this way, with discipline rather strictly enforced. Thick manuals of standard operating procedures govern this process, and much of the navy training is devoted to making them second nature. These procedures codify lessons learned from years of experience. But, as the Berkeley researchers discovered, the carrier’s inner life is much more complicated.

When things heat up, as during the launching and recovery of planes, the organizational structure shifts into another gear. Now the crew members interact much more as colleagues and less as superiors and subordinates. Cooperation and communication become more important than orders passed down the chain of command and information passed back up. With a plane taking off or landing once a minute, events can happen too quickly for instructions or authorizations from above. The crew members act as a team, each watching what others are doing and all of them communicating constantly through telephones, radios, hand signals, and written details. This constant flow of information helps flag mistakes before they’ve caused any damage. Seasoned personnel continuously monitor the action, listening for anything that doesn’t fit and correcting a mistake before it causes trouble.

A third level of organizational structure is reserved for emergencies, such as a fire on the flight deck. The ship’s crew has carefully rehearsed procedures to follow in such cases, with each member assuming a preassigned role. If an emergency occurs, the crew can react immediately and effectively without direction.

This multi-layered organizational structure asks much more from the crew than a traditional hierarchy, where following orders is the safest path and underlings are not encouraged to think for themselves. Here, the welfare of the ship and crew is everyone’s responsibility. As the Berkeley researchers note, “Even the lowest rating on the deck has not only the authority, but the obligation to suspend flight operations immediately, under the proper circumstances and without first clearing it with superiors. Although his judgment may later be reviewed or even criticized, he will not be penalized for being wrong and will often be publicly congratulated if he is right.”

The involvement of everyone, combined with the steady turnover among the officers and crew, also helps the Navy prevent such operations from becoming routine and boring. Because of the regular coming and going of personnel, people on the ship are constantly learning new skills and teaching what they’ve learned to others. And although some of the learning is simply rote memorization of standard operating procedures, the Berkeley researchers found a constant search for better ways of doing things. Young officers come on board with new ideas and find themselves debating with the senior noncommissioned officers who have been with the ship for years and know what works. The collision of fresh, sometimes naive approaches with a conservative institutional memory produces a creative tension that keeps safety and reliability from degenerating into a mechanical following of the rules.

The Navy has managed to balance the lessons of the past with an openness to change and create an organization that has the stability and predictability of a tightly run hierarchy but that can be flexible when necessary. The result is an ability to operate near the edge, pushing both people and machines to their limits but remaining remarkably safe.

No Failure to Communicate

Of course, an aircraft carrier is a unique situation, and there is no reason to think that what works there would be effective in a commercial setting with civilian employees. But when the Berkeley project examined a completely different sort of high-reliability organization, the researchers tracked its success to a similar set of principles.

The Diablo Canyon nuclear power plant, operated by Pacific Gas & Electric, lies just west of San Luis Obispo, Calif., on the Pacific coast. Although its construction was dogged by controversy and ended up taking 17 years and costing $5.8 billion, the plant has by all accounts proved to be one of the country’s best run and safest since it opened in 1985.

Like the aircraft carriers, Diablo Canyon appears at first to be a rigidly run hierarchy, with a formal chain of command leading up to a plant manager who is also a vice president of Pacific Gas & Electric. And it has a thick stack-a tower, really-of regulations telling employees how to do their jobs. This is how the regulators want it. Since Three Mile Island, the Nuclear Regulatory Commission has tried to ensure safety by insisting that nuclear plants follow an even more detailed set of rules. Plants are rated according to how many times they violate the regulations, and a pattern of violations will lead to closer supervision by the NRC and fines that, in serious cases, can run into hundreds of thousands of dollars.

But Paul Schulman, a political scientist at Mills College in Oakland who has collaborated with La Porte, Roberts, and Rochlin, has found that Diablo Canyon has another side-a more active, probing, learning side. Despite the hierarchy and the regulations, the organization is constantly changing, questioning accepted practice and looking for ways to do things better. It is not the same sort of change found on aircraft carriers, where the steady turnover of personnel creates a cycle of learning the same things over and over again plus a gradual improvement of technique. Diablo Canyon maintains a relatively stable group of employees who know their jobs well. Nonetheless, the nuclear plant is as dynamic as the carrier.

The reason, Schulman says, is that the plant has cultivated an institutional culture rooted in the conviction that nuclear plants will always surprise you. The result is two sets of decision-making procedures at the plant. The first, and more visible, consists of well-established rules for what to do in a particular situation. Some are carried out by computer, others by people. In general, Schulman says, this set of rules is designed to guard against errors of omission-people not doing something that they should.

But Diablo Canyon employees also work hard to avoid errors of commission-actions that have unexpected consequences. Because a nuclear plant is so complex, employees must constantly think about what they’re doing to avoid causing the system to do something unexpected and possibly dangerous.

This means that although the plant is constantly adding to its standard procedures as people learn more about the right approaches and spot new ways that things might go wrong, no one believes the organization will ever be able to write everything down in a book. Thus the plant management chooses employees partly on the basis of how well they will fit into such a flexible, learning-oriented culture. The least desirable employee, Schulman reports, is one who is too confident or stubborn.

This sort of continuous learning and improvement would not be possible if the Diablo Canyon organization were strictly hierarchical. Hierarchies may work for systems that are “decomposable”-that is, that can be broken into autonomous units-but a nuclear plant is, by its nature, tightly coupled. A modification to the steam generators can have implications for the reactor, or a change in maintenance procedures may affect how the system responds to the human operators. Because of this interdependence, the various departments in the plant must communicate and cooperate with one another directly, not through bureaucratic channels.

Constant Learning: The Blessings of Ambiguity

Members of the Berkeley project have studied not just aircraft carriers and nuclear power plants but also air traffic control systems and the operation of large electric power grids, and they detect a pattern.

A layered organizational structure, for instance, seems to be basic to the effectiveness of these institutions. Depending on the demands of the situation, people will organize themselves into different patterns. This is quite surprising to organizational theorists, who have generally believed that organizations assume only one structure. Some groups are bureaucratic and hierarchical, others professional and collegial, still others are emergency-response, but management theory has no place for an organization that switches among them according to the situation.

The realization that such organizations exist opens a whole new set of questions: How are such multi-layered organizations set up in the first place? And how do the members know when it’s time to switch from one mode of behavior to another? But the discovery of these organizations may also have practical implications. Although La Porte cautions that his group’s work is “descriptive, not prescriptive,” the research may still offer some insights into avoiding accidents with other complex and hazardous technologies.

In particular, high-reliability organizations seem to provide a counterexample to Yale sociologist Charles Perrow’s argument that some technologies, by their very nature, pose inherent contradictions for the organizations running them. Concerning technologies such as nuclear power and chemical plants, Perrow writes: “Because of the complexity, they are best decentralized; because of the tight coupling, they are best centralized. While some mix might be possible, and is sometimes tried (handle small duties on your own, but execute orders from on high for serious matters), this appears to be difficult for systems that are reasonably complex and tightly coupled, and perhaps impossible for those that are highly complex and tightly coupled.” But if Diablo Canyon and the aircraft carriers are to be believed, such a feat is not impossible at all. Those organization show that operations can be both centralized and decentralized, hierarchical and collegial, rule-bound and learning-centered.

Besides the layered structure, high-reliability organizations emphasize constant communication far in excess of what would be thought useful in normal organizations. The purpose is simple: to avoid mistakes. On a flight deck, everyone announces what is going on as it happens to increase the likelihood that someone will notice-and react-if things start to go wrong. In an air traffic control center, although one operator is responsible for controlling and communicating with certain aircraft, he or she receives help from an assistant and, in times of peak load, one or two other controllers. The controllers constantly watch out for one another, looking for signs of trouble, trading advice, and offering suggestions for the best way to route traffic.

Poor communication and misunderstanding, often in the context of a strict chain of command, have played a prominent role in many technological disasters. The Challenger accident was one, with the levels of the space shuttle organization communicating mostly through formal channels, so that the concerns of engineers never reached top management. The 1982 crash of a Boeing 737 during takeoff from Washington National Airport, which killed 78 people, was another. The copilot had warned the captain of possible trouble several times-icy conditions were causing false readings on an engine-thrust gauge-but the copilot had not spoken forcefully enough, and the pilot ignored him. The plane crashed into a bridge on the Potomac River.

When a 747 flown by the Dutch airline KLM collided with a Pan Am 747 on a runway at Tenerife airport in the Canary Islands in 1977, killing 583 people, a post-crash investigation found that the young copilot thought that the senior pilot misunderstood the plane’s position but assumed the pilot knew what he was doing and so clammed up. And the Bhopal accident, in which thousands of people died when an explosion at an insecticide plant released a cloud of deadly methyl isocyanate gas, would never have happened had there been communication between the plant operators, who began flushing out pipes with water, and the maintenance staff, which had not inserted a metal disk into the valve to keep water from coming into contact with the methyl isocyanate in another part of the plant.

Besides communication, high-reliability organizations also emphasize active learning: employees not only know why the procedures are written as they are but can challenge them and look for ways to make them better. The purpose behind this learning is not so much to improve safety-although this often happens-but to keep the organization from regressing. Once people begin doing everything by the book, operations quickly go downhill. Workers lose interest and become bored: they forget or never learn why the organization does things certain ways; and they begin to feel more like cogs in a machine than integral parts of a vibrant institution. Effective organizations need to find ways to keep their members fresh and focused on the job at hand.

Any organization that emphasizes constant learning will have to tolerate a certain amount of ambiguity, Schulman notes. There will always be times when people are unsure of the best approach or disagree even on what the important questions are. This may be healthy, Schulman says, but it can also be unsettling to managers and employees who think a well-functioning organization should always know what to do. He tells of a meeting with Diablo Canyon managers at which he described some of his findings. “What’s wrong with us that we have so much ambiguity?” one manager asked. The manager had completely missed the point of Schulman’s research. A little ambiguity was nothing to worry about. Instead, the plant’s managers should be concerned if they ever thought they had all the answers.

Schulman offers one more observation about high-reliability organizations: they do not punish employees for making mistakes when trying to do the right thing. Punishment may work-or at least not be too damaging-in a bureaucratic organization where everyone goes by the book, but it discourages workers from learning any more than they absolutely have to, and it kills communication.

If an organization succeeds in managing a technology so that there are no accidents or threats to the public safety, it may face an insidious threat: call it the price of success. The natural response from the outside-whether upper management, regulators, or the public-is to begin to take that performance for granted. And as the possibility of an accident seems less and less real, the cost of eternal vigilance seems harder and harder to justify.

But organizational reliability, though expensive, is just as crucial to the safety of a technology as is the reliability of the equipment. If we are to keep our technological progress from backfiring, we must be as clever with our organizations as we are with our machines.

Keep Reading

Large language models can do jaw-dropping things. But nobody knows exactly why.

And that's a problem. Figuring it out is one of the biggest scientific puzzles of our time and a crucial step towards controlling more powerful future models.