Subscribe to the Free Print Edition!
Celebrating 25 Years

Super saver

The race is on to build a fast, global file system for supercomputers

By Joab Jackson, GCN Staff

Save often, especially when you run a supercomputer.

For Gary Grider, group leader of Los Alamos National Laboratory’s High Performance Computing Systems Integration Group, saving data is of the highest importance. He is part of a team that is developing what may well be the world’s fastest supercomputer, a petascale machine called Roadrunner with more than 32,000 processors. IBM Corp. is leading the effort.

Jobs simulating nuclear-weapon degradation could take months to run. If a single processor failed — a statistical probability given the sheer number of CPUs used — the work would be corrupted. So, naturally, the lab wants to save often, just as you might do with your PC. But in this case, the procedure involves frequently saving terabytes of data as quickly as possible — no small feat.

That’s why Los Alamos specified that data must be able to flow back from the processors to the storage arrays at an unprecedented 50 Gbps, far beyond the capability of any single storage cluster. Running multiple storage arrays in parallel would do the trick, but that approach requires advanced techniques for coordinating the storage and management of data.

Roadrunner isn’t alone in facing this challenge. “You can easily put a lot of CPU power in the room, but to do useful work, you also need very good I/O,” said Mike Gigante, SGI’s engineering director of file-serving technologies. “Unfortunately, many people don’t think about the I/O until the CPU is set up, and they realize that the overall utilization efficiency of their computer is very low.”

File here
Managing a computer’s data is the job of the file system, and agencies, volunteer bodies and industry are working on a new generation of file systems, often called global or parallel file systems, that can support machines such as Roadrunner. The challenge is picking the right one for the job.

In many ways, Energy Department laboratories have been a driving force behind the development of global file systems. In 1994, Energy labs banded together to develop Lustre, a file system designed specifically for the upcoming supercomputer deployments. “We didn’t see anyone out there who had what we wanted,” Grider said.



GCN Popup