Rebooting on Mars

Monday, February 16, 2004

ALAMEDA, Calif. -- It's a PC user's nightmare: You're almost done with a lengthy e-mail, or about to finish a report at the office, and the computer crashes for no apparent reason. It tries to restart but never quite finishes booting. Then it crashes again. And again.

Getting caught in such a loop is frustrating enough on Earth. But imagine what it's like when the computer is more than 100 million miles away on Mars. That's what mission controllers faced when the Mars rover Spirit stopped communicating last month.

Ultimately, the fix that saved Spirit wasn't that different from how a PC would be repaired on Earth. It's just that the folks who have their hardware on Mars -- and the eyes of the world on them -- are better prepared for disaster.

Tech support for an $820 million mission is a cautious affair. Tools to recover from and fix any problem must be built into the system before launch. The systems' behaviors need to be completely understood and predictable.

"Luckily, during the design period, we anticipated that we might get into a situation like this," said Glenn Reeves, who oversees the software aboard the Mars rovers Sprit and Opportunity at NASA's Jet Propulsion Laboratory.

For stability, reliability and predictability, mission designers did not bust the budget and design the hardware or software from scratch. Instead, they turned to hardware and software that's been used in space before and has a proven track record on Earth as well.

"The advantage of using commercial software is it's well-known," said Mike Deliman, an engineer at Alameda-based Wind River Systems Inc., which made the rovers' operating system. "It has been used throughout the world in hundreds of thousands of applications."

The operating system, VxWorks, has its roots in software developed to help Francis Ford Coppola gain more control over a film editing system. But the developers, David Wilner and Jerry Fiddler, saw a greater potential and eventually formed Wind River, named for the mountains in Wyoming. VxWorks became a formal product in 1987.

Items that can't afford failure

The operating system is embedded in systems that control jetliners and atomic colliders, anti-lock braking systems in cars and even heart pacemakers. It's also been used successfully in the Mars Pathfinder lander, Mars Odyssey orbiter and Stardust comet probe.

"These are all things that can't afford to fail," Deliman said.

A key advantage VxWorks has over Microsoft Corp.'s Windows or the Unix operating system is that it is nimble enough to react quickly to any scenario that might crop up.

"If your heart beat goes irregular, you don't want it to take five minutes to figure out that your heartbeat has gone irregular," Deliman said in his office filled with computers, an empty fish tank and a few dog toys. "You want to be able to catch it right off the bat."

That's simply not available yet in Windows or Unix.

"I'm sure you've done things with Windows and perhaps gone off to go get a drink in the fridge, made a sandwich and come back and it's still waiting," Deliman said. "It's similar to Unix. Unix can take its sweet time about getting back to you what you want it to do."

VxWorks operates within only 32 megabytes of random access memory, and parts of it can be modified remotely without having to restart the entire system. (Windows users also can have fixes automatically sent, but restarts are very often required.)

VxWorks also can be tweaked to accommodate different hardware, said Deliman, who started working with JPL while Pathfinder was under development in 1994.

In the rovers, the hardware is a single-board computer called the RAD6000. It was originally developed in the early 1990s by a division of IBM Corp., Air Force Research Labs and NASA's JPL. It's now owned by BAE Systems Inc., of Manassas, Va.

The RAD6000, except for its protection from radiation, is similar to IBM's RS6000 server, which was popular among businesses in the 1990s. Its processor is a predecessor of the PowerPC, used in Apple Computer Inc.'s Macintosh computers since 1994.

Today, there are 145 RAD6000s running on 77 satellites in space, said Vic Scuderi, manager of space programs at BAE Systems. It's so reliable, there's only one running on each rover. Like VxWorks, it was used aboard Mars Pathfinder and Stardust.

The computer, which costs up to $300,000, runs at a fraction of the speed of today's desktop computers. It also has other limits, such as just 128 megabytes of random access memory.

But Spirit and Opportunity carry more flash memory -- the same type used in digital cameras to store pictures -- than any other spacecraft.

That turned out to be part of the problem that temporarily halted Sprit in its tracks.

Gobbled up memory

All computers, through the operating system, need to keep track of their files, whether they're on a hard disk or, as in the case of the rovers, in flash memory. And each file requires a little bit of memory.

After seven months of cruising between Earth and Mars as well as a couple weeks on the ground, thousands of files accumulated in flash memory, quickly gobbling up the 32 megabytes allocated for the operating system.

After more than two weeks on the ground, Spirit's computer reset itself. Over and over again. From the perspective of controllers on Earth, the device just stopped communicating.

Each time it tried to load its software, it maxed out the available memory, triggering an alarm and another reset. Eventually, the batteries drained, a scenario that activated a setting similar to "Safe Mode" on Windows PCs, where only essential files are loaded at startup.

"When it came up in this diagnostic mode, we started bringing back data, and that's when we figured out what really happened," Deliman said.

Engineers acknowledged that the problem could have been caught in preflight testing, though that would have slowed development of a program already on a tight schedule.

"Consuming all the memory in this vehicle is what we consider to be a very severe error," JPL's Reeves acknowledged. "The software actually behaved exactly as we expected it to."

Respond to this story

Posting a comment requires free registration: