TICK: Transparent Incremental Checkpointing at Kernel-level
OVERVIEW
The primary potential problem of frequent, automatic, and
user-transparent checkpointing and rollback recovery is the quantity
of generated checkpoint data. Frequent checkpointing of the large
memory footprints of scientific applications can quickly saturate
available bandwidth and fill nonvolatile storage.
TICK is designed to be a modular building block for implementing
checkpointing in large-scale Linux clusters. TICK is implemented
primarily as a Linux 2.6.11 kernel module, and consists of about 4,000
lines of code, with an additional 400 lines in the static part of the
kernel. TICK is neutral with respect to where checkpoint data is
saved: its function is to correctly capture and restore process state.
The actual management of the checkpoint data is handled by one or more
separate agents, which in our prototype implementation are other Linux
kernel modules. For lack of space this paper will not describe the
algorithms for data movement that can be used to implement a global
checkpointer for parallel applications. The checkpoint data may be
saved locally if a process restart is all that is needed, for example
after a machine crash, or to a file system when the instances of TICK
on each CPU of a cluster are globally coordinated.
TICK is available as public domain software. The current version
is a kernel pacth of Linux 2.6.11. DOWNLOAD
TICK
This IEEE/ACM Supercomputing 2005 paper describes in more detail TICK
DOWNLOAD SC05 PAPER
FEATURES
While our primary goal is fault tolerance in large-scale parallel
computers, we believe that TICK could be useful in other environments
such as distributed or grid computing, and much more directly, for
load balancing via process migration. The essential properties of
TICK are that it is:
Kernel level: TICK is implemented at kernel
level to allow unrestricted access to processor registers, memory
allocation data structures, file descriptors, signals pending, etc.
Implemented as a kernel module: Writing, debugging and
maintaining kernel code can be time consuming and
non-portable. Most of TICK's code is in a kernel module that can
be loaded and removed dynamically.
General purpose: The TICK checkpoint/restart mechanism
works with any type of user process, and processes may be
restarted on any node with the same operating environment.
Flexibly initiated: The checkpointing mechanisms of TICK
can be triggered by a local event, such as a timer, or a remote
event, such as a global strobe, in a very short and bounded time
interval.
User transparent: The user processes are not involved in
the checkpointing or restarting and there is no need to modify
existing applications or libraries. This implies that TICK can
support existing legacy software, written in any language,
without any changes.
Efficient: TICK tries to minimize degradation of
performance of a user process when checkpointing its state. TICK
also implements fast process restarts.
Incremental: TICK can perform frequent incremental
checkpointing on demand.
Easy to use: TICK provides a simple interface based on
the /proc file system that can be used by a user or system
administrator to dynamically checkpoint or restart a user process
on demand.
PERFORMANCE
The usefulness of a tool such as TICK depends critically on its
performance. We have chosen a set of scientific applications for
our performance evaluation, BT, LU and SP taken from the NAS Suite
of Benchmarks, and Sage and Sweep3D. In previous work, Sage was
found to be the most demanding test for checkpointing algorithms
because of its large memory footprint and lack of data
locality.
The experimental platform is a dual-processor AMD Opteron cluster.
Each processing node contains two AMD Opteron Model 246
processors, 3GB RAM, and a Seagate Cheetah 15K SCSI disk. The
proposed checkpoint/restart mechanisms have been implemented in
the Linux kernel version 2.6.11 with the page size configured to
4KB.
Runtime overhead of full checkpointing for various checkpoint
intervals when storing the checkpoints to main memory, and also to
the local disk for Sage and Sweep3D.
Runtime
overhead of full checkpointing for various checkpoint intervals
when storing the checkpoints to main memory for Sage and
Sweep3D.
Runtime
overhead of full checkpointing for various checkpoint intervals
when storing the checkpoints to the local
disk for Sage and Sweep3D.
When the checkpoints are stored in main memory every minute the worst case is only 4% with Sage-300MB. With disk checkpointing, the worst case is slightly greater, 6%.
TICK provides high responsiveness: the checkpoint can be triggered by an external event such as a global heartbeat in as little as 2.5 microseconds. It provides several mechanisms to implement incremental checkpointing at fine granularity with little overhead. It is also very modular, and allows quick prototyping of distributed checkpointing algorithms. The experimental results, obtained on a state-of-the-art cluster, show that TICK can be used as a building block for various checkpointing algorithms. We have demonstrated that with TICK it is possible to implement frequent incremental checkpointing, with intervals of just a few seconds, with a run-time increase that is less than 10% in most configurations.