Atmospheric Four Dimensional Data Assimilation:
Applications in High Performance Computing

Principal Investigator: P. M. Lyster
Department of Meteorology
and Joint Center for Earth System Science
University of Maryland College Park

Abstract:

This is the first annual report to the testbed that has arisen out of the new Cooperative Agreement(NCCS5-150): Grand Challenge Applications and Enabling Scalable Computing Testbed(s) in Support of High Performance Computing. It briefly summarizes the work of this PI team for the first three years of the High Performance Computing and Communications (HPCC) Grand Challenge project (1993-1995) and the first months (April 1 to September 30) of the new Cooperative Agreement (1996). This is a collaborative project that has involved scientists from NASA/Goddard Space Flight Center Data Assimilation Office (DAO, Head R. Rood), the University of Maryland Joint Center for Earth System Science (JCESS), Jet Propulsion laboratory, and Syracuse University.

1. Scientific and Computational Objectives

Four Dimensional Data Assimilation (4DDA) involves using climate models of the Earth System (atmosphere, land surface, and ocean) and estimation-theoretic methods for melding real-world observations into the model. Figure 1 shows a schematic of a data assimilation system. The fact that the observations are of unknown quality, and that they are inhomogeneously placed in space and time, makes this a theoretical and computationally difficult problem. 4DDA attempts to provide the best estimate of the evolving state of the Earth System, by extracting the maximum amount of information from the available observations. The scientific output, or product, are gridded, best-estimate, consistent datasets that can be used by scientists who are studying the earth's climate. Of particular importance are reanalyses of past decadal datasets up to the present, and the need to do this with rapid turn around. A second product of the research is increased knowledge of the science of data assimilation through the ability to handle large problems promptly. This is receiving attention in a number of important fields of science, including weather forecasting, space weather, oil recovery, and the accurate initialization of partial differential equations.

The primary objective of this HPCC Cooperative Agreement is to use the computing load of 4DDA to explore the computing limits of key algorithms at Goddard Data Assimilation Office (DAO), and in doing so to advance the science of high performance computing.

The key algorithms under study are:

The key aspects of this are: processor speed, main-memory size, memory-access speed (including inter-processor as well as on-processor communication bandwidths), and I/O. In the coming years the computing requirements of the DAO are estimated to be:

Year Sustained gigaflop/s Storage (terabytes)
1995 0.5 13.
1998 50. 43.
2000 150. 70.

These numbers were obtained by using present-day performance figures, and using conservative projections based on the realistic trends in Mission to Planet Earth (MTPE) and Earth Observing System Data and Information System (EOSDIS) in a budget constrained environment. In 1998 the DAO will deliver an operational data assimilation system to EOSDIS, with continuing support beyond that.

2. Technical Approach

The work on 4DDA (especially the PSAS and Kalman filter algorithms) is computationally intensive in terms of memory speed, volume, and data throughput. Hence the applications are suited to test most aspects of high performance computing in the coming years.

The principal segments of the GEOS-DAS are the model (GEOS-GCM), the data quality control modules (QC), and the statistical analysis-solve routines (PSAS). For the model, the DAO will use the algorithm that was developed by the Goddard Grand Challenge team of Max Suarez. The quality control modules have been parallelized by scientists Miloje Makivic and Gregor von Laszewski of Syracuse University's Northeast Parallel Architectures Center. The scientific algorithm for PSAS has been developed at the DAO during the period 1991 to the present. A key component is a non-sparse large matrix ( tex2html_wrap_inline75 x tex2html_wrap_inline75 ) solve using a conjugate gradient algorithm. At JPL, Hong Ding, Robert Ferraro, and Don Gennery have worked on the parallel message-passing PSAS operational analysis code, using the Intel computers at the California Institute of Technology. At the University of Maryland and NASA/GSFC, Peter Lyster, Steven Cohn, Richard Ménard, L.-P Chang, and Richard Rood have developed the Kalman filter algorithm; they have also used the Intel machines. The domain-decomposition approach is described in the reference Lyster et al. (1996) and is significant in that the scientific and numerical algorithm were enabled by the availability of large-memory, fast computers.

3. Scientific Accomplishments

The PSAS algorithm represents a significant advance in the science of three-dimensional variational assimilation (3D-VAR) that is being used at centers around the world. The main scientific development has been made by scientists under the direction of Steven Cohn, Arlindo da Silva, and Jing Guo at the DAO. The (HPCC) Grand Challenge project (1993-1995) has not directly contributed to the scientific development of PSAS, however the access to more powerful computers in the present Cooperative Agreement will hasten development.

The Kalman filter (KF) represents a rigorous approach to 4DDA that minimizes the ad hoc approximations. As such, the algorithm dynamically evolves not only the system state (wind, temperature, etc.) but also evolves the error correlations between these quantities. Therefore, this algorithm also makes use of considerable floating-point speed and main memory. A two-dimensional Kalman filter was developed by Peter Lyster, Steven Cohn, Richard Ménard, and L.-P. Chang (Lyster et al. 1996) to study the transport of trace chemical constituents in the middle atmosphere. For reference, a single correlation matrix for a application that uses tex2html_wrap_inline67 x tex2html_wrap_inline69 horizontal (latitude-longitude) resolution takes about one gigabyte of memory (single precision). This is the first known implementation of a brute-force Kalman filter. This Kalman filter algorithm represents a key advance in science and technology that could not have been performed without the development of large memory, high speed parallel computers. Figure 2 is Kalman filter generated map of Methane in the upper Stratosphere, using data obtained from the UARS HALOE satellite instrument. This is notable because the observations (black triangles) are extremely sparse, yet the coverage of the gridded data is almost globally complete.

4. Technological Accomplishments

As of 1992 none of the DAO codes had been ported to massively parallel processors, nor was there any parallel computing expertise in the project beyond the PI. The distributed-management associated with the multi-institutional project was conducted with considerable success.

A summary of significant achievements is:

The advanced Physical space Statistical Analysis System (PSAS) represents a challenge to both the floating point speed and the physical memory of the future analysis system at the DAO. With the current world-wide observational network that provides about 100,000 observations every six hourly period, the storage of a correlation matrix requires around 20 gigabytes (double precision). In preliminary work on PSAS at JPL, workers Hong Ding, Don Gennery, and Robert Ferraro have achieved a speed of 18.3 gigaflop/s for the key analysis-solve routines (matrix generation and solve) on 512 processors of the Intel Paragon. This equates to at least a 30-fold speedup over the same code on a single processor on the Cray C90. This is expected to enable the metric of 30 days of assimilated datasets per wall-clock day to be maintained in the coming years. Figure 3 shows the scaling of PSAS on the Intel Paragon. Recent work (as of August 1996) has involved altering the parallel PSAS to use the MPI message-passing library and porting the algorithm to the Cray T3D at Jet Propulsion Laboratory. Initial tests reveal that the BLAS library calls are 80% as efficient as those on the Intel machines. Therefore the 10 gigaflop/s milestones of the present Cooperative agreement is expected to be met (see section on Status and Plans).

The Kalman filter (KF) represents a rigorous approach to 4DDA that minimizes the ad hoc approximations. As such, the algorithm dynamically evolves not only the system state (wind, temperature, etc.) but also evolves the error correlations between these quantities. Therefore, this algorithm also makes use of considerable floating-point speed and main memory. A two-dimensional Kalman filter was developed by Peter Lyster, Steven Cohn, Richard Ménard, L.-P. Chang and Richard Rood (Lyster et al. 1996) to study the transport of trace chemical constituents in the middle atmosphere. This is the first known implementation of a brute-force Kalman filter. This Kalman filter algorithm represents a key advance in science and technology that could not have been performed without the development of large memory, high speed parallel computers. The peak speed at horizontal resolution of tex2html_wrap_inline67 x tex2html_wrap_inline69 on 512 processors of the Paragon is 1.3 gigaflop/s at single precision (1 hour of wall-clock time per day of assimilation).

As part of the interdisciplinary collaboration with Syracuse University's Northeast Parallel Architectures Center (NPAC), Gregor von Laszewski is completing a Ph.D. thesis in computer science with earth science applications (graduation, September 1996). He has studied aspects of the optimization and load-balancing for the operational Optimal Interpolation (OI) analysis system. His algorithm has achieved 400 megaflop/s for the analysis of 80,000 observations on 40 processors of an IBM SP-2. This work, and a subsequent publication, are expected to form the basis for future parallel analysis algorithms, including ocean data assimilation codes (von Laszewski et al. 1995).

Working on the quality control part of the analysis algorithm Miloje Makivic at Syracuse has achieved 430 megaflop/s (3.1 gigaflop/s peak) using data-parallel code on the Thinking Machines CM-5. This exceeded the metric that was set in 1992.

A number of modules have been submitted to the HPCC Software Exchange (http://sdcd.gsfc.nasa.gov/ESS/grand-challenges.html#rood), including data decomposition for data-parallel transport codes, and a global matrix solve suitable for the Kalman filter. The latter was a key algorithm that led to the understanding that global transpose operations are not necessarily detrimental to algorithmic performance on parallel computers provided they are performed efficiently. This is borne out in the successful implementation of the Kalman filter, and has been commented on by Peter Lyster and other researchers for more general applications at a workshop on Numerical Weather Forecasting at the European Center for Medium Range Weather Forecasting (Hoffman G-R.\ and N. Kreitz, 1995).

5. Status and Plans: Milestones for the 1996-1999 Cooperative Agreement

This 4DDA principal investigator proposal to the Cooperative Agreement: Grand Challenge Applications and Enabling Scalable Computing Testbed(s) in Support of High Performance Computing was successful, and in the Spring of 1996 a series of negotiations was held between the PI and members of NASA/Goddard Project Team. This has resulted in a number of amendments to the proposal. List 1 shows the the updated milestones for the project. These are milestones by which payments will be made to the PI team under the new format for the Cooperative Agreement with NASA. There is now no involvement of Syracuse University in the project, and it is intended that some of the work will be subcontract ed to Jet Propulsion Laboratory.

LIST 1: PROJECT MILESTONES
Principal Investigator: Peter Lyster

  1. Completed update of agreement including negotiated milestones and software deliverables.
    Jul 31 96
  2. Submitted FY96 Annual Report to ESS Project via the WWW.
    Aug 15 96
  3. Achieved at least 10 Gigaflop/s sustained on core-PSAS conjugate gradient algorithm on the GSFC testbed (delivered with scaling analysis).
    Aug 31 96 NASA FY96
  4. Delivered documented version of core PSAS which achieved 10 Gigaflop/s milestone to the National HPCC Software Exchange (NHSE)
    Feb 15 97
  5. Submitted FY97 Annual Report to ESS Project via the WWW.
    Aug 15 97 NASA FY97
  6. Achieved at least 50 gigaflop/s sustained on GEOS-end-to-end (delivered with scaling analysis).
    Dec 15 97
  7. Delivered documented version of code suite which achieved 50 Gigaflop/s milestone to the NHSE.
    Feb 15 98
  8. Submitted FY98 Annual Report to ESS Project via the WWW.
    Aug 15 98
  9. Achieved at least 100 gigaflop/s sustained on GEOS-end-to-end (delivered with scaling analysis).
    Aug 15 98
  10. For the Kalman/Lagrangian filter, achieve x200 speedup over the performance on a single node C90 (delivered with scaling analysis).
    Aug 15 98 NASA FY98
  11. Delivered documented version of code suite which achieved 100 Gigaflop/s milestone to the NHSE. Part of this should be documentation of GEOS-end-to-end including the significant contributions that the HPCC program had in developing guidelines for the DAO's 1998 production system.
    Feb 15 99
  12. Final report
    Apr 15 99 NASA FY99

TABLE 2: The configuration of the computers that will be used in the Cooperative Agreement testbed (Cray Research was the successful vendor). The Cray T3D will be installed at NASA/Goddard Space Flight Center in the summer of 1996, and will be used to obtain the 10 gigaflop/s milestone. The T3E is scheduled for early 1997, and will be used to obtain the 50 gigaflop/s milestones. Cray will supply, offsite, a T3E with (probably) 1000 pe's to enable the 50 and 100 gigaflop/s milestones. The timeframe for the milestones of this PI project has been synchronized with the availability of this machine.

Machine Cray T3D Cray T3E
Number of pe's 512 384(at least)
Memory/pe 64 MBYTES 128 MBYTES
Max speed/pe 150 megaflop/s 600 megaflop/s
Achievable speed/pe 10-30 MFLOPS 30-120 MFLOPS

Point of Contact
Dr. Peter M. Lyster
The University of Maryland
College Park
Email: lys@dao.gsfc.nasa.gov
Telephone: (301) 805-6960
Fax: (301) 805-7960

FIGURES

4DDA Schematic Graphic

Figure 1. A schematic of the DAO data assimilation system. Input and output data types are representative of the current system. Note the larger, more complete output data sets.

Haloe Graphic

Figure 2. Kalman filter map of Methane in the upper Stratosphere, using data obtained from the UARS HALOE satellite instrument. Note the sparse observations as represented by the black triangles.

PSAS Scaling

Figure 3. The scaling curve for the PSAS Solver performed on the Intel Paragon parallel computer at the California Institute of Technology. The maximum is 18320 Megaflop/s on 512 processors.

Significant Publications

Ding, H. and R. Ferraro, 1995a: A General Purpose Parallel Sparse Matrix Solvers Package, Proceedings of the 9th International Parallel Processing Symposium, p. 70, April 1995.

Ding, H. and R. Ferraro, 1995b: Slices: A Scalable Concurrent Partitioner for Unstructured Finite Element Meshes, Proceedings of SIAM 7th Conference for Parallel Processing, p. 492.

G. von Laszewski, M. Seablom, M. Makivic, P. M. Lyster, S. Ranka: Design issues for the parallelization of an Optimal Interpolation Algorithm. Coming of Age: Proceedings of Sixth ECMWF Workshop on the Use of Parallel Processors in Meteorology, Eds. G-R. Hoffmann and N. Kreitz, 290-302, (World Scientific, 1995); to be submitted to Concurrency: Practice and Experience.

Lyster P.M., S.E. Cohn, R. Ménard, L.-P. Chang, S.-J. Lin, and R. Olsen, 1996: An Implementation of a Two-Dimensional Kalman Filter for Atmospheric Chemical Constituent Assimilation on Massively Parallel Computers. Accepted for publication in Mon. Wea. Rev., 26 pp.

A number of modules have been submitted to the HPCC Software Exchange.


Peter Lyster
Tue Aug 27 09:46:17 EDT 1996