Field Work Proposal FY2001

High Performance Computing Systems Research

KJ-01-01-02

20b. Publications

"The Portable Batch Scheduler and the Maui Scheduler on Linux Clusters"
B. Bode, D. M. Halstead, R. Kendall, Z. Lei, and D. Jackson
Proceedings from the 4th Annual Linux Conference, pp. 217-224 (2000).

"Parameterized Heuristics for Autonomous Adaptive Routing in Large Networks"
A.R. Mikler, V.G. Honavar, and J.S.K. Wong
The Journal of Systems and Software.(accepted , 2001).

"Intelligent Mobile Agents in Large Distributed Autonomous Cooperative Systems"
J.S.K. Wong and A.R. Mikler
The Journal of Systems and Software, Special Issue: Software Engineering and Systems in the New Millennium. 47, pp. 75-87 (1999).

"High Performance Computational Chemistry; an Overview of NWChem a Distributed Parallel Application"
R.A. Kendall, E. Aprà, D.E. Bernholdt, E.J. Bylaska, M. Dupuis, G.I. Fann, R.J. Harrison, J. Ju, J.A. Nichols, J. Nieplocha, T.P. Straatsma, T.L. Windus, and A.T. Wong
Computer Physics Communications, 128, pp. 260-283 (2000).

"Implementation and Performance Evaluation of a Computational-Intensive Climate Simulation Application"
H. Wang, G. Takle, G. Prabhu, and R. Todi
Proceedings of ParCo99, Netherlands, August 1999.

"Distributed Simulation over Loosely Coupled Domains"
A. Boukerche, A. Fabbri, and A.R. Mikler
Proceedings of the Fourth IEEE International Workshop on Distributed Simulation and Real-Time Applications. San Francisco, CA, August 2000, pp. 18-25.

"Design and Implementation of a Mobile Agent Infrastructure"
A.S. Hopper, A.R. Mikler, and J. Mayes
Proceedings of the Third Workshop on Distributed Communities on the Web 2000 (DCW 2000) Quebec City, June 2000. Lecture Notes in Computer Science 1830 (Kropf et. Al. editors), pp. 1 92-201. Springer Verlag, Berlin.

"Parallel Distributed Event Simulation Across Loosely Coupled Domains - Experimental Results"
A.R. Mikler, and A. Fabbri
Proceedings of the High-Performance Computing Symposium HPC 2000, (A.Tentner, ed.), Washington D.C., April 2000, pp. 274-279.

"A Generalized Portable SHMEM Library for High Performance Computing"
K. Parzyszek, J. Nieplocha and R.A. Kendall
Proceedings of the IASTED Parallel and Distributed Computing and Systems, 2000 Las Vegas, Nevada, November 2000, (M. Guizani and X. Shen, Eds). pp. 401-406, IASTED, Calgary.

Invited Talks and Posters

"Networking and Message Passing"
D.M. Halstead
Extreme Linux Developers Forum, Santa Fe, March 2000.

"PBS and the Maui Scheduler on Linux Clusters"
B.M. Bode
USENIX Extreme Linux Technical Conference, Special Session on Cluster Computing, Atlanta, Georgia, October 2000.

"Scalable parallel applications"
D.M. Halstead
USENIX Extreme Linux Technical Conference, Special Session on Cluster Computing, Atlanta, Georgia, October 2000.

"A Generalized Portable SHMEM Library for High Performance Computing"
K. Parzyszek
Research Gems in the SC2000 Conference, Dallas, Texas, November 2000.

"Scalable Computing Applications on Clusters"
D.M. Halstead, B.M. Bode, D.E. Turner, and R.A. Kendall
SC2000 Conference, Dallas, Texas, November, 2000.

20c. Purpose

* Enable scalable application development and parallel hardware utilization.
* Extend cluster computing to a wider community by enhancing communications protocols.
* Optimize cluster hardware and management methodologies.
* Investigate the viability of aggregating different clusters of homogeneous machines into a single load balanced super-cluster.
* Evaluate and implement high-speed communication networks in cluster applications.
* Promote parallelization methods to the wider computational community and help with the task of porting existing code to parallel environments.
* Develop a robust domain specific Application Programming Interface (API) for modular and fault tolerant application component development and amalgamation.
* Broaden the acceptance of scalable parallel computing by providing an online tutorial primer for message passing based programming.
* Lower the barrier to cluster installation and management within the scientific community by making the Cluster Cookbook available on the Web.
* Test, make available, augment and/or co-develop software infrastructures such as a portable SHMEM and parallel Input/Output (I/O) libraries that provide a robust application development environment targeted at cluster based supercomputers.
* Generate a portable parallel benchmark to investigate the benefits and limitations and programming models for clusters of Symmetric Multi Processing (SMP) nodes
* Investigate the overhead of a general compression library to reduce the storage and transmission overhead in Central Processor Unit (CPU) rich, communication poor clusters.

The Scalable Computing Laboratory (SCL) focuses on high performance computing with attention to its rapidly changing nature. "Scalable" means not just parallel processing, but also the fact that usage environments, prices and expectations change, and the underlying technology of the hardware changes at rates not seen in other areas of science. In contrast to efforts aimed at "make machine X solve problem Y and get performance Z," the SCL looks at the broader problem of how to make a range of machines solve a range of problems with a range of performance tradeoffs, so that the computational research that is done has lasting scientific value. We undertake research that does not rely on legacy software and existing habits, in the hope of more quickly discovering and exploiting computational breakthroughs.

Of vital importance is the awareness that the complexity of modeling a scientific problem should be centered on the problem itself, and not the implementation of the model. To this end the "scalability" of a given machine is determined not only by how many processors can be added, or how fast the communication fabric runs, but also by how well the holistic design lends itself to programming formalisms that promote code longevity and portability.

The ability to provide a production computing environment with the capability to also afford a robust application development environment on cluster based supercomputers requires a wide range of tools and software infrastructure that is far from complete. Although the research community has made great strides in many areas, the entire infrastructure still has weaknesses that necessitate further research and development. Also, applications must adapt to this changing environment and often coerce changes across the full gamut of the software infrastructure. This effort will also focus on a subset of tools, libraries, and components of the software infrastructure applicable to many applications. In general, computational chemistry and physics codes will be used as a medium to test new and existing software algorithms and workstation clusters; this will provide the general framework for improving application scalability.

The healthy overlap between the students and researchers in Iowa State University (ISU) continues to provide a rich source of talented assistants to aid the SCL researchers. Of particular merit among these are Krishna Elango, Krzysztof Parzyszek, Brian Smith, Zhou Lei and Yuri Alexeev (Graduate Assistants), and Chris Csanady, Doug Fuller and Lisa Sels (Undergraduate Assistants). In addition, the SCL is engaged in several collaborative efforts with former members of the research staff who assumed associate status when they left to fill university faculty positions elsewhere, including Dr. Armin Mikler (University of North Texas, working on the theory of clustering clusters) and Dr. Quinn Snell (Brigham Young University, working on the cluster scheduler). Finally two new collaborations involve researchers at Pacific Northwest National Laboratory (PNNL) (Dr. David Dixon, Dr. Jarek Nieplocha, Dr. Jeffrey Nichols and Dr. Theresa Windus) and Argonne National Laboratory (ANL), (Remy Evard and J. P. Navarro).

20e. Approach

Key Personnel - FY2001
D. M. Halstead (PI) (1.0 FTE); R. A. Kendall (PI) (1.0 FTE); D. E. Turner (0.2 FTE)

Contributing Investigators
B. M. Bode; J. W. Evans; D. E. Heller; A. Mikler

20f. Technical Progress

Cluster Architecture

Halstead, Smith, Csanady and Fuller have performed intensive research into the optimal node architecture and interconnect specifications for the 64 dual processor machine cluster. By the use of a 45-Gigabit backplane Fast Ethernet switched interconnect and 64 dual processor machine with a total of 16 GigaBytes of RAM, a low cost, yet powerful and adaptable parallel computer has been constructed. The issues identified and addressed include the speed and reliability of the compute node memory used (Fuller), the latency and throughput of the interconnect (Csanady), the specifications of the environment required to house the machine (Halstead), and the cluster design characteristics needed for ease of use (Smith). In addition, the optimization of the network hardware drivers has proven to be a productive topic of investigation.

SCM

The Scalable Cluster Model (SCM) was developed within the SCL (Halstead, Smith and Csanady) to obviate the arduous task of installing cluster nodes. The process is now to the level of sophistication that a blank compute node can be installed from scratch in under seven minutes simply by entering its Ethernet address into a database, and starting the machine with a bootable floppy disk or by employing a network boot enabled Ethernet card. The machine will obtain its IP address from the DHCP (dynamic host configuration protocol)/BootP server, the system disk will be automatically formatted and the Operating System (OS) pulled down from the file server without user intervention. In addition to the simplification of replacing nodes, the SCM gives practical guidelines as to the most secure hardware/software configuration to use, and also dictates the degree of access users have to the cluster compute node, the development nodes and the file server.

Tools Cluster

The SCM has been utilized to configure a small test bed cluster of eight nodes connected by both Fast Ethernet, and Myrinet. A trusted user can reboot the nodes in this cluster into any one of four different operating system environments. The intent of this is to allow researchers to verify the performance, stability and portability of their software tools and applications before release.

Communications Cluster

As the field of cluster computing becomes more mature, the range of custom-built System Area Networks (SANs) has become more diverse. It is financially impractical to install all of these interconnect solutions on one of the large clusters, so a set of eight machines has been configured as a communications evaluation testbed. The communications fabrics under consideration include Fast and Gigabit Ethernet, Myrinet, cLAN by GigaNet and SCI by Dolphin systems. These communications fabrics are being studied for their suitability, not only for message passing and remote file access, but also in the important tasks of job initiation, network based node booting, hardware status monitoring, and for providing a distributed file system architecture. This cluster has recently been expanded to dual processors in each machine to allow for SMP message passing research in addition to inter-node communications.

IBM Shared University Research Grant

The cluster of 24 high-end multiprocessor IBM workstations is being used, amongst other things, to investigate high-speed communications issuing Gigabit Ethernet. The standard Ethernet Maximum Transmittable Unit (MTU) is defined in the IEEE 802.3 specification to be 1,500 bytes. This packet size was set in the early days of 10 Mbit Ethernet and has not changed for Fast Ethernet or even Gigabit Ethernet. Using this MTU value the maximum throughput, even for large messages is limited to less than 300 Mbit/s for all hardware and software tested to date using TCP/IP. Bode has configured the software driver for Gigabit Ethernet hardware which supports a larger frame size of 9,000, and with careful modification of the buffer and memory allocations, the throughput for large messages (>1 MByte) rises to more than 800 Mbit/sec. This is a substantial breakthrough in the evaluation of price/performance for Gigabit Ethernet over Fast Ethernet in cluster applications. Clearly, the success of this project and IBM's continuing interest in it is due, in large part, to the expertise of Dr. Bode and his very strong positive interactions with IBM personnel.

One of the difficulties in the use of Jumbo Frames is the lack of Gigabit Ethernet Switch support for this extension to the Ethernet standard. Indeed until recently only a 9-port switch from Alteon supported jumbo frames. Using four of the 9 port switches, we explored the use of multilink trunking across the switches to aggregate the switches into one larger network. Unfortunately for our primary target applications, this results in significant bottlenecks that reduce overall cluster performance. We are continuing to work with Network Interface Card (NIC) and switch vendors, such as Alteon WebSytems, Extreme Networks, SysKonnect, and Xylan (formerly Packet Engines) to find a solution to this problem.

Through this research, it has become clear that the communication bottleneck is not inherently caused by the packetization process, but arises from the latency incurred by the synchronous Transmission Control Protocol/Internet Protocol (TCP/IP) process. Work is underway to investigate the use of streaming mode in transmissions to reduce this overhead to a minimum in addition to OS bypass research presented below. In addition, we are working towards channel bonding multiple Gigabit NICs to carry a single data stream.

G4 Cluster

The first phase of construction on the G4 cluster was completed with 16 single processor G4 Macintosh computers. The system utilized the Yellow Dog version of the Linux OS put out by industry partner Yellow Dog Communications, Inc. In addition, we demonstrated good network performance using Fast Ethernet and Myrinet networks. The second phase of construction will complete the cluster with 32 nodes and full Myrinet interconnect to be discussed in the futures section.

Compaq Cluster

Following the initiation of the External Technology program with Compaq Corporation and in collaboration with the Iowa State University Department of Chemistry, we have constructed a cluster consisting of high performance 667MHz Compaq Alpha workstations connected with a low latency, high bandwidth communications architecture to give a tightly coupled system for scientific computing. This system has been constructed so that it is also tightly coupled to the G4 Cluster and weakly coupled to the IBM Cluster allowing us to explore cross-platform parallelization in a mixed network environment. While the initial construction is completed, the high-speed Myrinet network is still being configured and is not yet in full use.

Network Emulation Tool

In the course of evaluating communications protocols and cluster interconnects, the issues of latency, reliability and throughput are of critical concern. Although we have several tools for measuring the performance of an existing end-to-end communications channel, it would be useful to have an instrument for controlling these parameters to simulate non-ideal communications conditions. We are working with researchers at the University of North Texas (Mikler, Mayes) to extend the functionality of the National Institute of Standards and Technology (NIST) Net project developed by Mark Carson. This appliance is placed inline between two communicating devices and can reshape the data stream by modifying such critical characteristics as delay, jitter, packet order, bandwidth, data loss, and data corruption. Each of these parameters is tunable via a graphical user interface and requires no modification of the application under study. Examination of issues involving data integrity and sequencing, and their impact on software performance, are of vital importance when using protocols such as Virtual Interface Architecture (VIA) over standard Ethernet hardware where reliability is not guaranteed in hardware.

Cluster Cookbook

Continuing the work of Helmer, Halstead, Bode and Turner elucidated the process of building a cluster for parallel computation in our web site, available at http://www.scl.ameslab.gov/ Projects/ClusterCookbook/. Designed to be a tutorial guide for groups that wish to build clusters from scratch, the cookbook examines the hardware and software components of a cluster, including the commodity computers, interconnect, operating system, compilers and message passing libraries. Background information is provided about each component, and benchmarking instructions are given to assist in evaluating hardware. Assembly instructions and guidelines are provided, and numerous pointers to related information and vendors are included as well. In addition Turner has designed a beginners guide to message passing in tutorial format that will help complete the online assistance offered be the SCL.

Cluster Workshops

The Scalable Computing Laboratory hosted the first workshop on commodity PC clusters at the Des Moines Marriott, April 10-12, 1997. This brought together multiple independent efforts scattered across the country to blend commodity parts into a supercomputing environment. By comparing and sharing approaches, much "re-inventing the wheel" was eliminated.

This first meeting spawned a series of meetings titled Joint PC Commodity Cluster Computing (JPC4) workshops which have been held each year and brings together innovators and researchers throughout DOE, National Aeronautics and Space Administration (NASA), the university system and industry to compare notes in an informal, but technically aggressive, manner. The latest of these was hosted in October 1999 by DOE and NASA in Oak Ridge, Tennessee. It is expected that more than 200 participants will attend the next meeting. A member of the SCL (Halstead) is currently serving on the executive council of the JPC4 series.

SCL Infrastructure Upgrade

The network hardware and desktop environment used to provide connectivity and a work environment for SCL research staff has continued to be upgraded in a systematic manner. The main Ames Laboratory network was upgraded to a T3 connection to ESnet from a T1 connection, thereby increasing the offsite connectivity bandwidth by a factor of 30. All new machines within the SCL are connected to the Local Area Network (LAN) using Fast Ethernet and/or Gigabit Ethernet. The main LAN infrastructure was upgraded to have a total of 12 Gigabit Ethernet ports and over 100 Fast Ethernet ports. The desktop environment has shifted away from SGI workstations towards the lower cost personal computer platforms running UNIX, Windows or Mac OS, to reduce support costs and to broaden the available software base. The installation of workstations from vendors, such as Compaq, IBM, and Sun, will continue for users with more computationally demanding requirements. A user operable backup machine has been configured to allow for the burning of CDs (compact disks) to act as both archival and transport media to replace tapes in order to increase media shelf life, interoperability and reduce costs when compared to the traditional tape based solutions. The use of wireless connectivity will be continued to provide high-speed network connectivity to areas where landline access proves to be prohibitively expensive and of insufficient bandwidth, as well as within the lab building to facilitate the use of wireless network enabled portable computers.

SCL Computer Room Upgrade

The primary computer room for the SCL has a floor space of 340 feet2. In this raised floor environment, redundant cooling is maintained, and all machines are under uninterruptable power provided by a 50 kVA battery backup power unit. The 64 dual processor machine cluster has enjoyed up-times in excess of 40 days in this higher quality environment which attests to the stability of this computational solution in addition to the environmental controls. With the addition of the IBM Cluster, the G4 cluster and the Alpha cluster, a second computer room was constructed with a total raised floor area of 600 feet2 and redundant cooling. To reduce costs both rooms share the same uninterruptable power supply (UPS). A third computer room is available in Spedding Hall, should the power or cooling resources of the Wilhelm Hall facilities become saturated.

SCL Firewall Installation

A boarder router has been installed on the external choke point for access to the SCL network. This configuration was modeled after the DMZ configuration installed at Oak Ridge National Laboratory, and allows for both traffic filtering, intrusion detection and for external services provided by bastion hosts such as web, mail, file transfer protocol (FTP), telnet proxy and name service. This router also blocks direct communication with the main internal file servers for additional protection against bruit force denial of service attacks. The SCL already has a policy of installing TCP wrappers on its machines, so the impact of limiting incoming access will be minimized. Telnet and FTP have been deactivated on all SCL servers, and these machines now require the use of Kerborized Secure Shell (SSH) and Simple Control Protocol (SCP) from trusted hosts before they allow a connection to be established.

Generalized Portable SHMEM (GPSHMEM) Library

Parzyszek and Kendall have identified and implemented the functionality available in the T3D version of the SGI/CRAY Shared Memory (SHMEM) library. This provides the basic operations that most applications on the NERSC T3E use. Initial testing and benchmarking show that the library adds minimal overhead on the T3E, e.g., GPSHMEM on top of Aggregate Remote Memory Copy Interface (ARMCI) on top of CRI SHMEM. Current efforts include the porting of the Stanford Parallel Applications for Shared Memory (SPLASH) shared memory benchmarks to this programming paradigm.

Array Compression Library

With the current climate in the advances of computational technology (e.g., CPUs follow Moore's law, other components progress at a slower rate), data compression can be used to effectively increase the bandwidth of both inter-process communication and secondary storage. Application scalability can be tuned by trading CPU cycles for reduced bandwidth requirements. Standard file compression algorithms (Huffman and LZW) are typically formulated for character-based data. A library designed for real and integer based data with fixed encoding schemes (skip lists, with bit packed data, variable size encoding) would provide application and library developers a way to take advantage of the ability to trade CPU cycles for bandwidth or storage.
In collaboration with Mathematical and Computational Sciences (MCS) staff at Argonne National Laboratory (ANL) and staff at Pacific Northwest National Laboratory (PNNL), we have identified and implemented fixed compression algorithms that have high performance and well defined compression ratios. These and other algorithms will be the basic functional units of the resultant library.

Mixed Programming Models

There has been some work on applications that couple SMP based parallelization techniques with message passing yet the performance and ease of programming is still a research question. In this area Meng-Shiou Wu (graduate student in computer science) has implemented a threads based matrix multiply algorithm using Pthreads. He is finishing up the Message Passing Interface (MPI) based super-structure that will use the Pthreads routine as the node specific algorithm. The super-structure will be responsible for moving data among nodes. This is loosely based on Cannon's algorithm for distributed matrix multiplication. Initial performance metrics show the Pthreads implementation to be robust and fast on various 2 processor SMP systems.

NUG Executive Committee (NUGEX) participation

R. A. Kendall attended and participated in the National Energy Research Scientific Computing (NERSC) User Group (NUGEX) meeting held at Oak Ridge National Laboratory (ORNL) in June of 2000. Items of importance discussed:
* Acceptance/testing of NERSC III Phase II(IBM SP);
* NUGEX election and meeting schedules; and
* Need for updating the Requirements Document (a.k.a. green book).

20g. Future Accomplishments

Cluster Improvements

Over the next year Halstead, Kendall, Turner and Bode plan to continue to improve the cluster performance characteristics of the SCL clusters. In addition, the use of Gigabit Ethernet, Myrinet and other SAN products will be investigated as both an interconnect between individual machines, and as a connection between several clusters. It is expected that the cost per port for Gigabit Ethernet will continue to fall, as more vendors introduce compatible hardware. This makes it more feasible to expand the cluster with Gigabit as well as 100 Mbit Ethernet, thereby increasing the applicability of the cluster solution to more communication intensive codes. In addition, the performance of the SAN solutions will improve to ~2.5 Gbit/s in the near future.

Cluster of Clusters

In order to investigate the issue of scalability in the cluster-computing arena, SCL researchers (Halstead, Bode, Kendall) are working with the University of North Texas (Mikler) examining the issues of cluster aggregation over limited bandwidth links. The use of test clusters connected by a closely monitored communication channel of known bandwidth and congestion will give invaluable insight into the problems which will arise when a single parallel code is executed over two or more remote clusters. It is expected that the revolutionary work on mobile agents (Mikler) used to monitor multiple links functioning under less than optimal conditions will prove invaluable in this effort.

In addition to the technological requirements of cluster aggregation, the SCL (Halstead) will work with other National Laboratories to reach an understanding of the security ramifications of remote cluster access. A trust infrastructure needs to be established to which users may authenticate and request resources. Utilization of the substantial work that has already been done in this area under the Globus, Akenti and other DOE 2000 projects is fully intended. These solutions are based on accepted standards for authentication and trust including LDAP (lightweight directory access protocol), PKI (public key infrastructure), NIS (network information service) and centralized certificate authorities.

IBM Cluster

In collaboration with ISU the IBM cluster will be expanded to include several 4 processor (SMP) nodes in addition to the existing 22 nodes to allow greater research into mixed SMP and distributed memory-programming models. In addition, the network backbone will also be upgraded to allow more flexibility in the network configuration and greater bisection bandwidth. An expanded interaction with IBM may also result in the exploration of the use of Linux as the OS on the RS/6000 workstation hardware.

G4 Cluster

The existing 16-node cluster is currently being completed with an additional 16 dual CPU nodes. Once the additional complications of supporting SMP nodes are resolved these additional nodes will be fully integrated with the existing 16 nodes resulting in a 32 node cluster fully interconnected by Myrinet. This will give us the opportunity to explore a mixed parallel programming model with both single and dual CPU systems based on the same processor running the same OS.

Compaq Cluster

Following the initiation of the External Technology Program with Compaq Corporation, and in collaboration with the Iowa State University Department of Chemistry, the high-speed Myrinet network should soon be ready on the Tru64 Unix cluster, once several bugs are worked out of the existing Myrinet Tru64 Unix driver. This will allow testing application scalability on both moderate speed networks such as Fast Ethernet and high-speed networks such as Myrinet. Since the Myrinet network is directly coupled to the G4 cluster, we will also be able to test cross-platform heterogeneous parallel computing problems. These clusters are also connected via a Gigabit Ethernet link to the IBM Cluster that will allow the exploration of cluster aggregation with a single high-speed, yet finite, channel.

Video Conference Collaboration

The Scalable Computing Laboratory would like to increase its communication with other research centers by evaluating and installing low cost, high performance, Video Conference hardware. This is desirable to reduce the travel burden and lower the barrier of collaboration with other sites. This project will be facilitated by the ES-Net initiative to standardize on MPEG-2 quality video streams and the recent upgrade of the Ames Laboratory network connection to a 45 Mbit T3 onto the ES-Net backbone.

File Server Upgrade

The three year old Silicon Graphics Inc. (SGI) dual CPU 225MHz R10,000 file server is being replaced by a Sun Sparc Ultra 80 server. It is intended that the Network File System (NFS) data sharing protocol will be replaced by the more secure and higher performance Andrew File System (AFS) solution. The Server will be directly connected to the tape backup jukebox, the Gigabit Ethernet backbone network and the new 1 TerraByte Redundant Array of Independent Disks (RAID) System.

SCL Backup/Archival Expansion

With the expansion of the file server, it is anticipated that the DLT (digital linear tape) jukebox will need to be upgraded to a faster and larger hardware solution. To this end, a Dual drive, 15 tape DLT8000 based changer will soon be in production. It is also intended that the Graphical User Interface (GUI) driver software will be replaced with a package that allows for a more intuitive user interface, allowing individual users to perform basic backup and restore tasks without administrative intervention.

Parallel Compute Server for the SCL

The main compute boxes for the SCL are currently Silicon Graphics machines with a total of 14 R10000 processors, 3.5 GigaBytes of Random Access Memory (RAM), and over 50 Gigabytes of stripped disk. The maintenance cost of these machines is over $15,000 per year and they are over five years old. It is expected that these machines will be replaced with a cluster of high-end multiprocessor machines connected with a dedicated System Area Network (SAN), thereby leveraging the experience gained in the SCL cluster research to a production machine.

The new machine will be a cluster of multiple Symmetric MultiProcessing (SMP) computers connected with both a high speed commodity Ethernet, for general systems administrations, and a dedicated System Area Network, for Message passing and parallel applications development research. This will provide a flexible architecture, able to function both as a series of powerful shared memory machines for serial jobs, and also as a single machine with inter-node communications in excess of 200 MegaBytes/sec.

Network and Desktop infrastructure upgrades

The backbone network and desktop services will be upgraded in a staggered three year cycle to ensure that cost and service disruption are kept to a minimum. The overhead of both hardware and software are kept low by ensuring that Commercial Off The Shelf (COTS) solutions are used wherever possible. This has the additional benefit of minimizing the problems from disparate file format and media interchange compatibility issues.

Generalized Portable SHMEM Library

The current generalized Portable SHMEM (GPSHMEM) implementation will be augmented to include support for the FORTRAN 77 standard which does not have pointer support. The SPLASH shared memory kernel and application benchmarks will be used to show the viability of the programming model. SPLASH benchmarks were originally designed for the Stanford DASH system. For cluster based computing this effort may include the augmentation of the ARMCI engine to make it work on heterogeneous clusters, which was not a requirement of the original development. This augmentation will also allow a significant component of Global Arrays to function on heterogeneous clusters as well.

Parallel I/O Functionality

There are two kinds of parallel I/O that are required for high performance clusters. One is a global parallel file system that makes resources and data available to a parallel application and the other is a parallel interface for secondary storage of computational intermediates (e.g., scratch storage). The former is typically the Network File System (NFS) on a cluster, but this is not scalable to large numbers of nodes. The latter is typically an actual MPI-IO implementation or something similar. We propose to test and assess both kinds of libraries on the SCL cluster systems, including a heterogeneous environment. The focus will be ease of use, ease of administration, robustness, and scalable performance of the resultant file system. A thorough review of available file systems will be made but possible candidates for the global parallel system include the Parallel Virtual File System from Clemson University and the Galley2 File System from Dartmouth. The application interface to scratch storage will include providing an MPI-IO implementation (of which there are several to test and evaluate) and an alternative parallel I/O library that provides 64 bit addressing, per process files, a shared file layer on an underlying distributed multi-computer system (e.g., clusters). The latter library will be based on the current PARIO tools used in NWChem and porting the Distant I/O interface to the cluster environment. The Distant I/O component showed roughly a factor of 3 improvement in bandwidth on the IBM SP, which is currently the only implementation.

Array Compression Library

With the current climate in the advances of computational technology (e.g., CPUs follow Moore's law, other components progress at a slower rate), data compression can be used to effectively increase the bandwidth of both inter-process communication and secondary storage. Application scalability can be tuned by trading CPU cycles for reduced bandwidth requirements. Standard file compression algorithms (Huffman and LZW) are typically formulated for character-based data. A library designed for real and integer based data with fixed encoding schemes (skip lists, with bit packed data, variable size encoding) would provide application and library developers a way to take advantage of the ability to trade CPU cycles for bandwidth or storage.

The overall design of the library is underway. The target applications, NWChem, GAMESS, and a generalized Configuration Interaction program of Ruedenburg and co-workers will provide the initial interface for the evolving infrastructure of the library. The infrastructure will need to be designed to couple multiple compression algorithms to maximize the compression of data. The API must expose some control to the users who will know the kinds of precision needed in their algorithms.

Domain Specific Application Programmer Interfaces

Many computational applications utilize tools and libraries that expedite parallelization of the application. Unfortunately, few applications have taken the next step in which the computational physics, chemistry, biology, etc. functionality, that can be separated from the main body of the application, has been abstracted into domain specific application programmer interfaces (DSAPI). This abstraction can be done in such a way that the components can be coupled in either a loose or tight fashion, can be used to provide a fault tolerant framework, can facilitate application longevity by providing implementation independent application structure, or to provide a rapid prototyping system that is specific to the computational science field. This subtask is two fold. First, we plan to document the decisions and abstraction mechanisms for NWChem, GAMESS, and other chemistry and physics codes available for parallel systems. The second aspect will follow our out-reach efforts in that as we assist others in porting their applications to the cluster environment, we will also investigate their needs for DSAPIs and guide them in their design and utilization. This will in turn broaden our own understanding of the optimal design approaches for DSAPIs.

Mixed Programming Models and Matrix Benchmark

Once the MPI based algorithm is complete, tested and the performance metric measured an additional matrix multiply algorithm will be implemented using the Global Arrays distributed shared memory programming model. This will provide an analysis of the possibilities of using mixed programming models where one is an asynchronous memory programming model (e.g., remote put and get operations). One applied application of our research effort will be the development of a tunable benchmark. There are a large number of benchmarks available that measure the CPU performance, the bandwidth and latency of memory, performance of the communication network, and performance of the secondary storage sub-system. Rarely do these benchmarks measure in concert more that two or three of these sub-systems. Furthermore, the secondary storage sub-system is often ignored completely. The mixed programming model matches the paradigm of systems coming from the vendor community. A benchmark that measures all four subsystems that any real scientific application utilizes will be of benefit to the computational science community. By designing an out of core algorithm for matrix multiplication that utilizes a mixed programming paradigm, with user-imposable limits on the sub-systems we can "simulate" how applications may perform on various computational resources.

NUG Executive Committee (NUGEX) participation

R.A. Kendall is up for re-election in early CY 2001. If re-elected he will continue to serve on the NERSC User Group Executive Committee that advises NERSC on the computational infrastructure issues of their facility. He is the chair of the elections and nominating sub-committee and a member of the sub-committee working on the next generation "User's Requirements" document (a.k.a, green book).

20h. Relationships To Other Projects

During FY2000, many important advances were made in strategic collaborations between the SCL and several organizations critical to the execution of efficient research. These collaborations include:
* Parallel resource scheduling with Argonne National Laboratory and Maui High Performance Computing Center
*Communications performance research with IBM, NERSC and LBNL.
* Development of a scalable, distributed Hartree-Fock code with the Chemistry Department.
* Participation in the DOE High-End Open System Group path forward series.
* Continued active participation in the NSF/DOE JPC4 cluster workshops initiated by Ames Laboratory in 1997.
*Multiple development projects with Ames Laboratory and ISU researchers to parallelize their code.
* Cluster development and parallel applications research in collaboration with IBM and ISU.
*Collaboration with Terra Soft on the production of it's Yellow Dog Linux operating system in a parallel cluster environment running on the G4 hardware architecture.
*Initiation of an External Technology Program collaboration with Compaq.

It is expected that the IBM SUR grant will initiate strong ties between that company, ISU and the SCL both in the area of parallel computing and for wide utilization of the Chemistry and Physics code produced by the local research groups.

In order to encourage the use of parallel application development within the SCL and the wider ISU community, the laboratory is collaborating with many groups in parallelizing their applications to run on the 64 dual processor Fast Ethernet cluster. The following is a list of users, their applications and the current stage of development of their code:

+ Dr. Rodney Fox, (rofox@iastate.edu), Chemical Engineering. Evaluating hardware for modeling mixed phase complex fluid dynamics.
+ Dr. Mark Brydon, (kmbryden@iastate.edu) Engineering. Purchasing a cluster for simulation and visualization of complex fluid flow systems.
+ Wade W. Huebsch (airedale@iastate.edu), Aerospace Engineering and Engineering Mechanics.
Completed parallelization of airplane wing simulation code.
+ Dr. Mihail Sigalas (sigalas@ameslab.gov), Physics Department. Completed parallelization of waveguide simulation code and ported it to cluster, Paragon and T3E.
+ Ihab F. El-Kady (ihab@iastate.edu), Physics Department. Completed parallelization of photonic bandgap simulation code.
+ Feyzi Inanc (financ@cnde.iastate.edu), Center for Nondestructive Evaluation. Started to parallelize X-rays scattering code.
+ Dr. Hao Wang (wanghao@cs.iastate.edu), Computer Science Department. Porting meteorology code.
+ Dr. Volker Brendel (volker@gremlin1.zool.iastate.edu), Molecular Biology Department. Parallelized code for use in genetic matching project.

We have also begun collaborations with several ISU professors for work related to the performance measurement goals:

+ Prof. S. Mitra, ISU Computer Science Dept., use of performance measurement tools in computer architecture courses (graduate student projects).
+ Prof. J. Wong, ISU Computer Science Dept., performance analysis of web services (with graduate student C. Kuppireddy, who has previously worked on a similar analysis of NetPipe).
+ Prof. D. Rollins, ISU Chemical Engineering and Statistics Depts., analysis of
measurement data from microprocessors and distributed systems.

The compression library involves Dr. Mike Minkoff, Dr. Ron Shepard, Dr. Al Wagner from Argonne National Laboratory and Dr. Robert Harrison from Pacific Northwest National Labortory. Mr. Weiye Chen (ISU) and Mike Dvorak (ANL student) have done the initial fixed length compression algorithms.

The GPSHMEM, part of the Parallel I/O, and the DSAPI work will be done in collaboration with Drs. Jarek Nieplocha, Jeffrey A. Nichols, and Theresa Windus at Pacific Northwest National Laboratory and the GAMESS Group (Prof. Mark Gordon and co-workers) at Iowa State University. Any software SCL develops under these tasks will be used by NWChem and GAMESS and will be made available to the public via the normal PNNL/ISU distribution and support mechanisms.

The SCL has initiated several new collaboration efforts with Dr. Jarek Nieplocha and Drs. Jeffrey A. Nichols, Robert J. Harrison and Theresa Windus at Pacific Northwest National Laboratory.

Field Work Proposal FY2001

High Performance Computing Systems Research

KJ-01-01-02

Ames Laboratory, US Department of Energy, Iowa State University, Ames, Iowa

Ames Laboratory, US Department of Energy,
Iowa State University, Ames, Iowa