When a lone disk dies, the system pulls data from other drives and writes it to the disk’s replacement slowly, so the supercomputer can continue working. If more failures occur among nearby drives, the rebuilding process speeds up to avoid the possibility that yet another failure occurs and wipes out some data permanently. Hillsberg says that the result is a system that should not lose any data for a million years without making any compromises on performance.
The new system also benefits from a file system known as GPFS that was developed at IBM Almaden to enable supercomputers faster data access. It spreads individual files across multiple disks so that many parts of a file can be read or written at the same time. GPFS also enables a large system to keep track of its many files without laboriously scanning through every one. Last month a team from IBM used GPFS to index 10 billion files in 43 minutes, effortlessly breaking the previous record of one billion files scanned in three hours.
Software improvements like those being developed for GPFS and disk recovery are crucial to enabling such giant data drives, says Hillsberg, because in order to be practical, they must become not only bigger, but also faster. Hard disks are not becoming faster or more reliable in proportion to the demands for more storage, so software must make up the difference.
IDC’s Conway agrees that faster access to larger data storage systems is becoming crucial to supercomputing—even though supercomputers are most often publicly compared on their processor speeds, as is the case with the global TOP500 list used to determine international bragging rights. Big drives are becoming important because simulations are getting larger and many problems are tackled using so-called iterative methods, where a simulation is run thousands of times and the results compared, says Conway. “Checkpointing,” a technique in which a supercomputer saves snapshots of its work in case the job doesn’t complete successfully, is also common. “These trends have produced a data explosion in the HPC community,” says Conway.