BERKELEY, CA—The National Energy Research Scientific
Computing Center (NERSC) today announced a milestone in high-performance
computing: successfully stopping and restarting a number of scientific
computing jobs on a CRAY T3E supercomputer without any data processing
loss or discontinuity.
Called "checkpointing," the stop/restart
procedure achieved twice in one week at NERSC is believed to be
the first time such a procedure has been accomplished on a massively
parallel processing (MPP) supercomputer. Checkpointing has been
a major goal in the MPP community since the first parallel machine
was plugged in 10 years ago. C. William McCurdy, head of the Computing
Sciences organization at Ernest Orlando Lawrence Berkeley National
Laboratory, called NERSC’s checkpointing milestone "a remarkable
achievement."
Checkpointing maximizes system availability for
users and minimizes wasted compute cycles because no recomputation
is necessary after restarting. The process brings all of the programs
running on the computer to the same stage (or checkpoint) and stops
them, records all the information, transfers that information out
of the machine, then puts information back in and gets it all running
again with no loss of processing time or data. Recovery of the unfinished
applications resumes from the interrupt point.
While this feature has been available on Cray’s
vector systems for more than 12 years, the company made it available
on the T3E system in 1997. Checkpointing on MPP systems is significantly
more difficult because of the complexity of synchronizing up to
2,000 processors.
Although being able to stop and restart a computer
system without data loss is important for any system, the value
is much greater as the size of the system increases. For example,
without checkpointing, when a single-processor system runs 12 hours
of computing work, is interrupted and cannot be restarted, it loses
12 hours of work. On the other hand, when a 2,000-processor system
runs 12 hours and cannot be restarted it loses 24,000 hours of computing
work.
"As far as I know, no other MPP system vendor
is planning to have system-wide checkpoint/restart features without
having to reprogram applications," said Bill Kramer, deputy
director of NERSC. "Therefore, this is really a momentous step
for those of us in the high-performance computing community."
Successful checkpointing will allow the NERSC
staff to suspend system work with minimal disruption and downtime
for the hundreds of T3E users around the country, making NERSC an
even more valuable computational science resource.
"This signifies a major milestone in Cray’s
and NERSC’s commitment to provide robust, reliable MPP computing
cycles to DOE’s unclassified energy research community," said
Michael Declerck of the NERSC Systems Group. Declerck is the computer
scientist charged with putting the CRAY T3E-900 through its month-long
acceptance tests. The machine was delivered in mid-July. "We
think this is the first practical demonstration of checkpointing
in a working, MPP production environment. This functionality allows
NERSC to minimize disruption to scientific computing and provides
the center with the capability to run large and extremely long-running
jobs."
In addition to allowing "transparent"
maintenance and upgrades, the checkpointing software tool will allow
NERSC to efficiently move jobs between processors or make larger
pools of processors available for bigger jobs. The center will be
able to efficiently manage the transition from running large workloads
with many different applications to dedicating the system to one
single, complex problem that spans the full 512-processor system.
"MPP started as an experiment in a small
niche of the scientific research computing environment," said
Steve Reinhardt, CRAY T3E project director at Cray Research Inc.
"This achievement with the CRAY T3E illustrates Cray’s commitment
to production-quality, highly scalable computing. This expands the
applicability of MPP to a much wider range of industries and uses." The
successful checkpointing was made possible by software developed
by Cray Research Inc. in close collaboration with NERSC. The procedure
was successfully demonstrated on both of NERSC's CRAY T3E supercomputers,
the 512-processor and the 160-processor units. The checkpointing
was performed once to allow scheduled maintenance and a second time
to test advanced operating system features. The restarted jobs were
running on clusters ranging from 16 to 256 processors.
"After we completed the downtimes, all of
the user jobs on the machine were successfully restarted and the
machines were put back on line," said James Craw, head of the
NERSC Systems Group. "It's interesting that we achieved this
major milestone and none of our users noticed--which was our objective."
NERSC (http://www.nersc.gov),
established in 1974, provides high performance computing services
to DOE’s Energy Research programs at national laboratories, universities
and industry. The facility has been located at Berkeley Lab since
May 1996. Berkeley Lab (http://www.lbl.gov)
is a U.S. Department of Energy national laboratory located in Berkeley,
CA. It conducts unclassified research and is managed by the University
of California.
|