Seminar Announcement

Professor Jack Dongarra

University of Tennessee

"An Overview of High Performance Computing and Self Adapting Numerical Software"

Date: Tuesday, May 17, 2005
Time: 2:00 pm.
Place: Building 453, Room 1001 (Armadillo Room)
P Clearance / Unclassified
Contact: Steven Lee ((925) 424-5989) or Linda Becker ((925) 423-0421)

Sponsored by: ISCR and CASC.


Abstract:

In this talk we will look at how High Performance Computing has changed over the last ten years and look toward the future in terms of trends. A new generation of software libraries and algorithms is needed for the effective and reliable use of (wide area) dynamic, distributed and parallel environments. Some of the software and algorithm challenges have already been encountered, such as management of communication and memory hierarchies through a combination of compile-time and run-time techniques, but the increased scale of computation, depth of memory hierarchies, range of latencies, and increased run-time environment variability will make these problems much harder. As the number of processors in today's high performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most today's high performance computing applications can not survive node failures and, therefore, whenever there is a node failure, have to abort themselves and restart from the beginning or a stable storage based checkpoint. Along these lines, we discuss work on the development of fault-tolerant linear algebra algorithms. We present an approach to building fault survivable high performance computing applications using diskless checkpointing with FT-MPI. We give a detailed presentation on how to write a fault survivable application with FT-MPI using diskless checkpointing and evaluate the performance overhead of our fault tolerance approach by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that we can survive a small number of simultaneous processor failures with low performance overhead and little numerical impact.

Email: dongarra@cs.utk.edu

Speaker's web page: http://www.netlib.org/utk/people/JackDongarra/

Institution web page: http://www.cs.utk.edu/

News | Calendar | People | Groups | Current Projects | Collaborators | Sponsors | Publications | More Information | Search | Sitemap
LLNL | CAR | CASC | ISCR | ITS | Members Only | LLNL Disclaimers
UCRL-MI-125922 |