Software disasters often are people problems

Monday, October 18, 2004

The consequences of software failures are rarely trivial.

By Matthews Fordahl ~ The Associated Press

SAN JOSE, Calif. -- New software at Hewlett-Packard Co. was supposed to get orders in and out the door faster at the computer giant. Instead, a botched deployment cut into earnings in a big way in August and executives got fired.

Last month, a system that controls communications between commercial jets and air traffic controllers in southern California shut off because some maintenance had not been performed. A backup also failed, triggering potential peril.

Such disasters are often blamed on bad software, but the cause is rarely bad programming. As systems grow more complicated, failures instead have far less technical explanations: bad management, communication or training.

"In 90 percent of the cases, it's because the implementer did a bad job, training was bad, the whole project was poorly done," said Joshua Greenbaum, principal analyst at Enterprise Applications Consulting in Berkeley. "At which point, you have a real garbage-in, garbage-out problem."

As governments, businesses and other organizations become more reliant on technology, the consequences of software failures are rarely trivial. Entire businesses -- and even lives -- are at stake.

"The limit we're hitting is the human limit, not the limit of software," Greenbaum said. "Technology has gotten ahead of our organizational and command capabilities in many cases."

Big software projects -- whether to manage supply chains, handle payroll or track inventory -- tend to begin with high expectations and the best intentions.

Often, however, the first step toward total disaster is taken before the first line of code is drawn up. Organizations must map out exactly how they do business. All this must be clearly explained to a project's technical team.

"Mistakes hurt, but misunderstandings kill," said John Michelsen, chief executive of iTKO Inc., which makes software that helps companies manage big software projects and test them automatically as they're being developed.

Too often, he said, programmers are handed a lengthy document explaining the business requirements for a software project and left to interpret it.

"Developers are least qualified to validate a business requirement. They're either nerds and don't get it, or they're people in another culture altogether," said Michelsen, referring to cases where development takes place offshore.

The Dallas-based company's LISA software attempts to reduce the complexity of testing, so nontechnical executives in charge of major software projects can ensure the actual code adheres to their vision.

The lack of robust testing during and after such a project likely contributed to the Sept. 14 radio system outage over the skies of parts of California, Nevada and Arizona.

Though there were a handful of close calls, all 403 planes in the air during the incident managed to land safely, said FAA spokesman Donn Walker. A handful violated rules that dictate how close they are allowed to fly to each other -- but the FAA maintains there were no "near misses."

The genesis of the problem was the transition in 2001 by Harris Corp. of the Federal Aviation Administration's Voice Switching Control System from Unix-based servers to Microsoft Corp.'s off-the-shelf Windows Advanced Server 2000.

By most accounts, the move went well except the new system required regular maintenance to prevent data overload. When that wasn't done, it turned itself off as it was designed to do. But the backup also failed. In all, the southern California system was down for three hours, though other FAA centers restored communications within seconds, Walker said.

The FAA's investigation is continuing, and Harris Corp. did not return a call seeking comment.

Michelson said the failure was in inadequate testing.

"On a regular basis, the FAA should have been downing that primary system and watching that backup system come up," he said. "If it doesn't go up and stay up, they would have known they had a problem to fix long before they needed to rely on it."

Another common theme in failures lies in the ranks of employees who actually must use the systems.

Often they're not given proper training. There's also a chance that they don't want the project to succeed, especially if they see it as a threat to employment.

"It becomes a major role of (management) to kind of herd the cats in and make them all line up in a reasonable way," said Barry Wilderman, an analyst at the Meta Group. "That's why this stuff is so hard."

Respond to this story

Posting a comment requires free registration: