Contact

Dr. Christian Engelmann
Oak Ridge National Laboratory
One Bethel Valley Road, Oak Ridge, TN 37831-6173, USA +1 (865) 574-3132
+1 (865) 576-5491
engelmannc@ornl.gov

Abstract

Dr. Christian Engelmann is a Research and Development Associate in the System Research Team of the Computer Science Research Group in the Computer Science and Mathematics Division at Oak Ridge National Laboratory (ORNL). He holds a PhD and a MSc in Computer Science from the University of Reading, UK, and a German Certified Engineer diploma (MSc equivalent) in Computer Systems Engineering from the Technical College for Engineering and Economics (FHTW) Berlin.

Dr. Engelmann's work deals with software research and development for next-generation extreme-scale high-performance computing (HPC) systems. As part of the System Research Team at ORNL and in collaboration with other laboratories and universities, Dr. Engelmann's research aims at providing high-level reliability, availability, and serviceability (RAS) for next-generation supercomputers to improve their resiliency (and ultimately efficiency) by performing research and development in novel high availability and fault tolerance system software solutions. Another area Dr. Engelmann is focusing on is research and development in core system software technologies to enable "plug-and-play" supercomputing, which offers transparent portability of software to eliminate most of the software modifications caused by divers supercomputing platforms and supercomputing system upgrades.

Other, past research by Dr. Engelmann included work on a pluggable lightweight heterogeneous Distributed Virtual Machine (DVM) environment, where clusters of personal computers, workstations, and supercomputers can be aggregated to form one giant DVM (in the spirit of its widely-used predecessor, Parallel Virtual Machine (PVM)). Further past work was part of a Cooperative Research and Development Agreement (CRADA) with IBM that focused on a new generation of scientific algorithms (super-scalable algorithms) to address the challenges in scalability and fault tolerance for extreme-scale supercomputers, such as the IBM Blue Gene/L system.

News

Dr. Engelmann recently attended a meeting of ORNL's System Research Team with Josh Simons, Tim Marsland, Greg Lavender, and Rebecca Arney from Sun Microsystems discussing virtualization in HPC (insideHPC article, Josh Simons' blog).
Dr. Engelmann has been awarded a PhD degree in Computer Science from the University of Reading, UK.
Jaguar, a Cray XT5 system at ORNL, is #2 in the Top 500 List of Supercomputer Sites. It achieves a maximal LINPACK performance of 1.059 PFlop/s with a theoretical peak performance of 1.3814 PFlop/s. Jaguar is the second HPC system to exceed 1 PFlop/s (10¹⁵ Floating Point Operations Per Second), and the fastest open science supercomputer in the world.
Dr. Engelmann was a member of the ORNL research exhibit at the ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008.
Dr. Engelmann was funded by the U.S. Department of Energy's Institute for Advanced Architecture and Algorithms (IAA) to work on the Scalable Algorithms for Petascale Systems with Multicore Architectures project.
Dr. Engelmann was funded by the U.S. Department of Energy's Forum to Address Scalable Technology for Runtime and Operating Systems to work on the Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond project.

Upcoming Presentations

February 14-18: A Tunable Holistic Resiliency Approach for High-Performance Computing Systems. Poster at the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP) 2009, Raleigh, NC, USA. Presented by Geoffroy R. Vallée.
February 16-18: The Case for Modular Redundancy in Large-Scale High Performance Computing Systems. Paper at the 27th IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN) 2009, Innsbruck, Austria.
February 18-20: Proactive Fault Tolerance Using Preemptive Migration: Model and Classification. Paper at the 17th Euromicro International Conference on Parallel, Distributed, and network-based Processing (PDP) 2009, Weimar, Germany.
March 9-12: Proactive Fault Tolerance Using Preemptive Migration: Model and Classification. Paper at the 10th LCI International Conference on High-Performance Clustered Computing (LCI) 2009, Boulder, CO, USA. Presented by Hong H. Ong.

Conference Deadlines

February 9: 4th IEEE International Conference on Networking, Architecture, and Storage (NAS) 2009, Zhang Jia Jie, China, July 9-11, 2009.
February 25: 2nd International Workshop on Resiliency in High Performance Computing (Resilience) 2009, 2009 Munich, Germany, June 9-10.
March 1: 20th International Conference on Database and Expert Systems Applications (DEXA) 2009, Linz, Austria, August 31 - September 4, 2009.

Select Publications

Journal Publications ( Abstract, Publication, Citation, DOI)

Xubin (Ben) He, Li Ou, Martha J. Kosa, Stephen L. Scott, and Christian Engelmann. A Unified Multiple-Level Cache for High Performance Cluster Storage Systems. International Journal of High Performance Computing and Networking (IJHPCN), volume 5, number 1-2, pages 97-109, 2007. Inderscience Publishers, Geneve, Switzerland. ISSN 1740-0562.
Christian Engelmann, Stephen L. Scott, Chokchai (Box) Leangsuksun, and Xubin (Ben) He. Symmetric Active/Active High Availability for High-Performance Computing System Services. Journal of Computers (JCP), volume 1, number 8, pages 43-54, 2006. Academy Publisher, Oulu, Finland. ISSN 1796-203X.
Christian Engelmann, Stephen L. Scott, David E. Bernholdt, Narasimha R. Gottumukkala, Chokchai (Box) Leangsuksun, Jyothish Varma, Chao Wang, Frank Mueller, Aniruddha G. Shet, and Ponnuswamy (Saday) Sadayappan. MOLAR: Adaptive Runtime Support for High-End Computing Operating and Runtime Systems. ACM SIGOPS Operating Systems Review (OSR), volume 40, number 2, pages 63-72, 2006. ACM Press, New York, NY, USA. ISSN 0163-5980.
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Process-Level Live Migration in HPC Environments. In Proceedings of the IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis (SC) 2008, Austin, TX, USA, November 15-21, 2008. ACM Press, New York, NY, USA. ISBN 978-1-4244-2835-9.
Arun B. Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Proactive Fault Tolerance for HPC with Xen Virtualization. In Proceedings of the 21st ACM International Conference on Supercomputing (ICS) 2007, pages 23-32, Seattle, WA, USA, June 16-20, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1.
Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In Proceedings of the 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2007, pages 1-10, Long Beach, CA, USA, March 26-30, 2007. ACM Press, New York, NY, USA. ISBN 978-1-59593-768-1.
Jyothish Varma, Chao Wang, Frank Mueller, Christian Engelmann, and Stephen L. Scott. Scalable, Fault-Tolerant Membership for MPI Tasks on HPC Systems. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS) 2006, pages 219-228, Cairns, Australia, June 28-30, 2006. ACM Press, New York, NY, USA. ISBN 1-59593-282-8.
Christian Engelmann and George A. (Al) Geist. Super-Scalable Algorithms for Computing on 100,000 Processors. In Lecture Notes in Computer Science: Proceedings of the 5th International Conference on Computational Science (ICCS) 2005, Part I, pages 313-320, Atlanta, GA, USA, May 22-25, 2005. Springer Verlag, Berlin, Germany. ISBN 978-3-540-26032-5. ISSN 0302-9743.