First Second Code Improvement Completed
NNumerical Simulations For Active Tectonic
Processes: Increasing Interoperability And Performance
JPL Task Order: 10650
GeoFEST Documentation Only
Milestone GF – Code
Improvement due
date: 6/30/20034
First code improvement (functional enhancement and
speedup). Documented source code made publicly available via the Web.
·
PARK on 1024
CPU machine with 400,000 elements, 50,000 time steps in 5 times the baseline
code
·
GeoFEST
(assuming availability of 880 processor machine) 16M elements, 1000 time steps
in the same time as the baseline code using the PYRAMID AMR libraries
·
Virtual
California with N=700 segments for 10,000
time steps in 1 hour or less, MPI parallel implementation, running on
M-processor machine, with 2 GB of memory per CPU, speedup of approximately M/2
on up to 256 processors. Investigation
of fast multipole method for this code.
·PARK on 256 CPU machine with 150,000 elements,
5,000 time steps in the same time as the baseline case
GeoFEST - links to PYRAMID and runs on a parallel
machine - Produce a plot of scaled speedup that will show that we are
maintaining efficiency as the number of
processors and problem size increase. Assuming availability of a 64 CPU
Beowulf, 1,250,000 elements, 1000 timesteps, in the same time as the baseline
case.
Jet Propulsion Laboratory
Mail Stop 183-335
4800 Oak Grove
Drive
Pasadena,
CA 91109-8099
818-354-4737
Michele Judd:
Jet Propulsion Laboratory
Mail Stop 183-335
4800 Oak Grove Drive
Pasadena, CA 91109-8099
Jet Propulsion Laboratory
Mail Stop 238-600
4800 Oak Grove Drive
Pasadena, CA 91109-8099
818-354-6790
Community Grid Computing
Laboratory
IndianaBrown University
501 N. MortonBox 1846, Suite 224 Brown University
BloomingtonProvidence, IN RI 47404-373002912-1846
gcf@indiana.eduTerry_Tullis@Brown.edu
812-856-7977401-863-3829
Jet Propulsion Laboratory
Mail Stop 183-335
4800 Oak Grove Drive
Pasadena, CA 91109-8099
818-354-4737
Jet Propulsion Laboratory
Mail
Stop 238-600
4800 Oak
Grove Drive
Pasadena,
CA 91109-8099
818-354-6790
Community Grid Computing Laboratory
Indiana University
501 N. Morton, Suite 224
Bloomington, IN
47404-3730
812-856-1212
Community
Grid Computing Laboratory
Indiana University
501 N. Morton, Suite 224
Bloomington,
IN 47404-3730
812-856-7977
Professor
Computer Science
Department
University of Southern
California
Los Angeles, CA 90089-0781
213-740-4504
Center for Computational Science and Engineering
U. C. Davis
Davis, CA 95616
530-752-6416
Center for Computational Science and Engineering
U. C. Davis
Davis, CA 95616
Jet Propulsion Laboratory
Mail Stop 126-347
4800 Oak Grove
Drive
Pasadena,
CA 91109-8099
818-354-6920
Michele Judd:
Jet Propulsion Laboratory
Mail
Stop 183-335
4800 Oak
Grove Drive
Pasadena,
CA 91109-8099
818-354-4994Anne
Chen:
University of Southern
California
Mail
Code 0781
3651
Trousdale Parkway
Los
Angeles, CA 90089-0742
213-740-7285
Michele Judd:
Jet Propulsion Laboratory
Mail Stop
183-335
4800 Oak
Grove Drive
Pasadena, CA 91109-8099
818-354-4994
Community Grid Computing Lab
Indiana University
501 N. Morton,
Suite 224
Bloomington,
IN 47404-3730
812-856-1212
University of California, Irvine
Environmental Analysis and Design
Irvine, CA 92697-7070
949-824-5491
Jet Propulsion Laboratory
Mail Stop 126-347
4800 Oak Grove
Drive
Pasadena, CA 91109-8099
818-393-5353
University of California,
Irvine
Environmental Analysis and
Design
Irvine, CA 92697-7070
Jet Propulsion Laboratory
Mail Stop 300-233
4800 Oak Grove
Drive
Pasadena,
CA 91109-8099
Margaret.T.Glasscoe@jpl.nasa.gov
818-393-4834
AD HOC Team Member
Jet Propulsion Laboratory
Mail Stop 169-315
4800 Oak Grove Drive
Pasadena, CA 91109-8099
818-393-3920
354-4350
This
milestone report documents blah blah blahMilestone F of the
QuakeSim project (Numerical Simulations For
Active Tectonic Processes: Increasing Interoperability And Performance) for
NASA’s Earth Science Technology Office, Computational Technology Program. The text of the milestone appears on Page 1,
and requires demonstration of substantially larger problems than Milestone E (7/30/02,
see tTable 1) for the
now-parallel earthquake codes PARK and GeoFEST. Milestone
F also requires demonstration of parallel scaling.
In the next section we describe the large problems
that demonstrate code improvement for the PARK and GeoFEST code. Then we give details for the PARK
code (code description, algorithm,
numerical method, documentation, scaling analysis, and scientific and
computational significance), and for the GeoFEST code (including the same
topics).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Code |
Machine Wallclock Time |
Processors |
Date |
Elements |
Time Steps |
GeoFEST Milestone E |
Solaris workstation (JPL) 13.7 Hours |
1 |
July 30, 2002 |
55,369 |
1,000 |
GeoFEST Milestone F |
Thunderhead (GSFC) 2.8 Hours |
64 |
September 1, 2003 |
1,400,198 |
1,000 |
GeoFEST Milestone G |
Cosmos (JPL) 12.0 Hours (under 13.7 allowable) |
490 (well under 880 allowable) |
March 26, 2005 |
16,021,034 |
1,000 |
|
|
|
|
|
|
Table
1: Computer runs demonstrating baseline, and Milestone F performance enhancements for PARK and GeoFEST applicationsand Milestone G
performance enhancements for GeoFEST.
Table 1 summarizes the results for the improved
code demonstration runs. Additional
runs demonstrating parallel scaling are detailed in Section 4.
The nature of the code improvements we describe is
that problems of much larger size than the baseline case are solved in the same
time, by use of advanced computing technology (Mmultipole methods for
PARK, domain partitioning for GeoFEST, and efficient MPI parallel coding for
both PARK and GeoFEST).
For the PARK code the problem demonstrating the
improvement over baseline has geometry shown in Figure 1. The boundary conditions (appropriate for the
geographic setting at pParkfield, CA) are the same as for the baseline case,
but the mesh density is increased from 15,000 to 150,000 rectangular elements,
and the problem duration is extended from 500 to 5000 time steps. Traditional methods lead to work scaling
with the square of the number of elements, and linear with time steps. Our improvements are due to efficient
parallel implementation and use of a Mmultipole solution technique that scales much
better than a law quadratic in element count.
Figure 1. Views of grid geometry and distribution of
constitutive properties for PARK at three different scales. A. shows the overall geometry. The
numbers are distances in km, both vertically and horizontally from the origin,
which is at the earth’s surface at Middle Mountain, directly above the
hypocenter of actual Parkfield earthquakes. Boundary conditions of 35 mm/yr
slip rate are applied on the light brown areas, while the blue area is locked
with zero slip rate. B. Closer view
of the grid geometry. The large nearly solid red area is a region with square
elements ~65 m on a side. Within it is an area at about 10 km deep and 3 km
horizontally that appears solid red at this magnification, with ~22 m squares
and ~7.4 m squares. C. Same area as
in (B) showing contours of the
constitutive parameter a-b without
the grid. Instability can occur in areas colored brown, the darkness being
proportional to the tendency to be unstable. D. Magnification of the finest area, showing the ~7.4 m elements
and part of the area with ~22 m elements. As in (C) the color shows the intensity of a-b.
For GeoFEST, the problem geometry is shown in
Figure 2. Elastic and viscoelastic
deformation is simulated for 1.4 million finite elements, for a duration of
1000 time steps. The significant
increase is in the number of elements, which was about 55,000 for the baseline
case (also running 1000 time steps).
Note that the baseline case was modeled on the 1994
Northridge earthquake, using a single fault of 300 square km in a domain
of 240 x 240 x 100 km. The improved
demonstration case is based on the 1992 Landers event, using three closely
arranged faults within an 865 square km area in a
domain volume area of 1000 x 1000 x 60 km.
For GeoFEST, the problem
geometry is shown in Figure 2. Elastic
and viscoelastic deformation is simulated for 1.4 million finite elements, for
a duration of 1000 time steps. The
significant increase is in the number of elements, which was about 55,000 for
the baseline case (also
running 1000 time steps). Note the baseline case was modeled on the 1994 Northridge event, using a single fault
of 200 square km in a domain of 240 x 240 x 60 km. The improved demonstration case is based on the 1992 Landers
event, using three closely arranged faults with 600 square km in a domain 1000
x 1000 x 60 km.
Figure 2a: Finite
element mesh LandersGap64 for GeoFEST
milestone F improvement code improvement problem. Colors indicate partitioning
among processors (limited to 16 processors in this image for clarity, actually
64 processors were used). Partitions
cluster near domain center due to the high mesh density that is used near the faults. Note the vertical scale differs from the
horizontal scales.
Figure 2b: Fault
segments and surface nodes for LandersGap64 mesh, center region.
Table 1 summarizes the results for these improved
code demonstration runs. Additional
runs demonstrating parallel scaling are detailed in Section 4.
Table 1: Computer runs demonstrating baseline and
Milestone F performance enhancements for PARK and GeoFEST applications.
The top-level web site
for the QuakeSim task is at http://quakesim.jpl.nasa.gov . Source code for the threeGeoFEST codes may be found at http://quakesim.jpl.nasa.gov/download.html. Files required for the GeoFEST baseline cases (Milestone E) may be found at http://quakesim.jpl.nasa.gov/milestones.html.
http://quakesim.jpl.nasa.gov/milestones.html.
Blah
is here
Blah
is here
Blah
is here
PARK is a model for
unstable slip on a single earthquake fault.
Because it aims to capture the instability, it is designed to
represent the slip on a fault at many scales, and to capture the developing
seismic slip details over an extraordinary range of time scales (sub-seconds to decades). Its simulation of the evolution of fault
rupture is the most realistic of the tools in QuakeSim for that scale,
demonstrating the multi-scale approach of this project. When transformed into an efficient parallel
simulation, it will be thea
powerful tool of choice for researchers seeking to determine the nature and
detectability of earthquake warning signals such as surface strains and
patterns of microseismicity.
In a typical application
the PARK code will compute the history of slip, slip velocity, and stress on a
vertical strike-slip fault that results from using state-of-the-art rate and
state frictional constitutive laws on the fault, which is currently that for a specific geographic setting at Parkfield,
California. The boundary conditions
are those appropriate for Parkfield, and the distribution of
constitutive properties on the fault zone are as realistic as our ability to
characterize the subsurface properties of the fault there allows. The methods developed in
solving this problem can be generalized to other geologic settings in which the
fault geometry, and the boundary conditions are not so simple and
multiple faults are involved.
The fault is represented by a rectangular grid, with
highly variable mesh density (see Figure 1).
The main program is a
boundary element program that determines the stress on every element of the
fault surface due to slip on every other element, using a Greens function
approach. The fault constitutive law is used to determine what the slip
velocity will be for that stress. and thisThe velocity multiplied by the time step gives the
slip to be used tofor calculatinge the stress in the next time increment. This
involves the forward time integration of coupled ordinary differential
equations.
Numerically, PARK is a
boundary element program that determines the stress on every element of the
fault surface due to slip on every other element, using a Greens function
approach. This is the first earthquake simulation code to seek and achieve
enhanced scalability and speed by employing a Mmultipole technique
(documented below). The Mmultipole experience
gained here will also be transferable to the Virtual California code and other
boundary element simulations. The power of massive parallel computing is
required for this problem in order to support many small slip patch elements in
orderneeded to cover the nucleation
scale that initiates the instability.
The integration is done with a fifth order
Runge-Kutta[1] scheme with adaptive step
size control. Because the time-steps range over ten orders of magnitude, the adaptive step-size
control is an essential element in the solution. The time steps range depending on whether the
fault is slipping very slowly in the interseismic period or very fast during an
earthquake, the adaptive step-size control is an essential element in the
solution.
The main program calls a
variety of subroutines and the one of these subroutines that calculates the
derivatives used in the forward time integration itself calls a Fast Multipole
library that is suitable for such Green's functions problems. The Multipole
approach allows a number of computations to scale as N log N rather than N2
as would otherwise be the case. The particularThis Fast Multipole approach being
used allows determination of the degree of grouping of the remote cells based
on an analytical approximation to the Green’s function. In order to
reduce computation time it also renumbers the elements so that those that are
near in space are also near in memory.
The flow of the program
and the functions of its main routines are found in the file Code_Description.doc posted at http://quakesim.jpl.nasa.gov/milestones and also http://quakesim.jpl.nasa.gov/documentation. The main program and most of its subroutines are
written in FORTRANortran 90. At
the time of the First Code Improvement Milestone Tthese programs were have been converted to use
MPI to run in parallel for the First Code Improvement Milestone.
In brief, tThe main program reads
inputs and sets up the problem, carries
out a time-stepping loop that
integrates the slip and stress field forward in time, and writes a summary and
closes files at the end. The
dominant task is a function “DERIVS,” which uses a fault constituitive relationship to determine slip
velocity based on local stress, includes effects of shear wave speed and
radiation damping, and uses the Multipole library to sum the stress
contributions of the slip on all fault elements to determine the local stress. So, within a single time step the objective is to
integrate the history using variable-step Runge-Kutta based on the
stress-velocity interactions. In each time step “DERIVS”
is called a total of six times, once initially by the main program and five
more times by the integrator, to result in a fifth-order Runge Kutta
integration.Hence DERIVS is called for initial derivatives,
then the integrator calls it six more times for fifth-order Runge Kutta. “DERIVS” relies on the Multipole code so it calls “sumtree” to gather all the
contributions to local stress, thenand then determines the local
stress and slip temporal derivatives from that. Finally it calls “forgettree” to clear the mMultipole structure.
In the parallel version, a single processor reads
the geometry and partitions it among the remaining processors, sending a
portion of the geometry and
local properties to each
one. Within the loop over time steps “DERIVS” uses the Mulipole
functions similar to the sequential code, except that “sumtree” is designed for parallel
use and has internal MPI calls.
The function that accounts forcomputing the stress interactions
between fault elements dominatinges the work in the problem are computing the stress interactions between fault elements. This interaction is made fast and efficient by
using a proven parallel fFast mMultipole library. Parallel decomposition was added to portions of the
remaining code in order to make good use of this parallel library.
We link the Fast Multipole library of Salmon and Warren (Salmon and Warren, 1997; Warren and Salmon, 1997), which is written in parallel using MPI.
Compute
the history of slip, slip velocity, and stress on a vertical strike-slip fault
that results from using state-of-the-art rate and state frictional constitutive
laws on the fault for a specific geographic setting at Parkfield, California.
The boundary conditions are those appropriate for Parkfield and the
distribution of constitutive properties on the fault zone are as realistic as
our ability to characterize the subsurface properties of the fault there allow.
The methods developed in solving this problem can be generalized to other
geologic settings in which the fault geometry, the boundary conditions are not
so simple and multiple faults are involved.
The main program is a boundary element program that
determines the stress on every element of the fault surface due to slip on
every other element, using a Greens function approach. The fault constitutive
law is used to determine what the slip velocity will be for that stress and
this velocity multiplied by the time step gives the slip to be used to
calculate the stress in the next time increment. This involves the forward time
integration of coupled ordinary differential equations. The integration is done
with a fifth order Runge-Kutta scheme with adaptive step size control. Because
the time-steps range over ten orders of magnitude, depending on whether the
fault is slipping very slowly in the interseismic period or very fast during an
earthquake, the adaptive step-size control is an essential element in the
solution.
The main program calls a variety of subroutines and
the one of these subroutines that calculates the derivatives used in the
forward time integration itself calls a Fast Multipole library that is suitable
for such Green's functions problems. The Multipole approach allows a number of
computations to scale as N log N rather than N^2 as would otherwise be the
case. The particular Fast Multipole approach being used allows determination of
the degree of grouping of the remote cells based on an analytical approximation
to the Greens function. In order to reduce computation time it also renumbers
the elements so that those that are near in space are also near in memory.
The main program and most of its subroutines are
written in Fortran 90. At the time of the First Code Improvement Milestone
these programs have been converted to use MPI to run in parallel. The Fast
Multipole library already is also written in parallel using MPI.
Code and documentation can be found in two places,
a web site (http://www.servogrid.org/slide/GEM/PARK/) and on turing and chapman, two of
the SGI Origin 3000s at NASA Ames. The web site is the only public location.
The machine turing is the front end to the
machine chapman on which the First Code
Improvement Milestone was run. For the purposes of NASA's verifying that the
First Code Improvement run is as described in the Milestone_Certification_Data
file and repeating the run if desired, it may be easier to use the copies of
the documentation, files, etc. that are located on turing or chapman, since
no compression, taring or compilation is needed there. In addition, because two
copyrighted, but easily and inexpensively available, subroutines are used that
cannot be posted on the public web site, it is also easier to verify the
behavior on the NASA Ames machines where these subroutines are available in the
src-bin
subdirectory of the /u/tullis/1st_Code_Improv_Milestone
directory. Verification can also be done by the public or by NASA officials by using
materials taken from the web site and by purchasing the subroutines if one does
not already have them. The directory structure on turing, chapman, and at the
website are the same to make it easier to compare each to the others.
Files needed to run the first improvement problem
are described under “Simulation
Details”, below.
Within the appropriately
named subdirectories under the 1st_Code_Improv_Milestone directory in which
this file is found can be found all the necessary material that describes the
First Code Improvement Milestone and gives instructions that would allow one to
duplicate it.
Included in the
"in" and "out" directories are all the materials from the
First Code Improvement Milestone run with 150000 elements and 256 processors
for 5000 time steps. For code testing purposes on one's own system it is useful
to set the number of time steps in the prk.dat.150003 file to a smaller number
than 5000 for the initial run; even 2 would be reasonable for the first run.
The materials in these
directories include:
Milestone_Certification_Data.txt
- a file that give the time required for the First Code Improvement run and
describes various parameters of the run.
README-setting_up_input_files.txt
- a file that tells one how to understand the input files including an
explanation of how the elements are created from the input files.
README-Compile.txt - a
file that tells how to create both the multipole library and the PARK fault
files using the appropriate Makefiles.
in - a directory that
contains the input files that were used in the First Code Improvement run.
out - a directory that
contains the output files that were generated in the First Code Improvement
run.
src-bin - a directory that
contains the PARK and related fault application files used in the First Code
Improvement run. The versions of this directory on turing and chapman also have
the object files and executable binary file (named park)
Downloads - a directory
that contains two unix-compressed tar files, PARK_Package_1st_Improv.tar.Z and
PARK_Package_NR.tar.Z, that allow one to generate the files needed for the
First Code Improvement run. The file PARK_Package_1st_Improv_NR.tar.Z contains
all the programs that are needed, including two copyrighted subroutines from
the Numerical Recipes book. This
version is only on turing, not on
the public web site. It can be copied by NASA officials for verification of the
program behavior and saves one making the additions needed for the other
version, PARK_Package_1st_Improv.tar.Z, that does not have the two Numerical
Recipes subroutines. See either the README-src-bin.txt file in the src-bin directory or the header
for the park.f file to learn what needs to be done to create these Numerical
Recipes subroutines. Except for the presence or absence of these two
subroutines, both of these tar files will create the Multipole library, the
source files for the PARK fault application, the input files for the First Code
Improvement Milestone run.
The tar files were created
on turing in the following way. The
libsw.a and the mpmy_seq.o files were moved from the t17-7/Objfiles/IRIX64 directory
to another temporary location. The object files and the executable file (park)
was moved from the src-bin
directory to Another temporary location. For creating the
PARK_Package_1st_Improv.tar.Z the file the numrec.f file containing the two Numerical
Recipes subroutines was also removed to another temporary locations, whereas
for creating the PARK_Package_1st_Improv_NR.tar.Z this file was not removed.
Then while in the home directory the following command was issued to produce
the versions without and with the Numerical Recipes subroutines:
tar -cvf
PARK_Package_1st_Improv.tar 1st_Code_Improv_Milestone
or
tar -cvf
PARK_Package_1st_Improv_NR.tar 1st_Code_Improv_Milestone
Then the .tar files were
compressed by issuing
compress
PARK_Package_1st_Improv.tar
and
compress
PARK_Package_1st_Improv_NR.tar
to
produce the PARK_Package_1st_Improv.tar.Z and the
PARK_Package_1st_Improv_NR.tar.Z files.
In the directory scaling, found
in the same 1st_Code_Improv_Milestone
directory in which this file is found, there are a number of files that
show how the job scales both with the number of elements and the number
of processors. Thirty sixThirty-six scaling runs were done,
involving all the combinations of the number of elements used (712, 5,292, 15,000, and
150,000) and
number of processors used (1, 2, 4, 8, 16, 32, 64, 128, and 256) (Table 2, Figure 3). All
these scaling runs were run for 100 time steps, whereas the full 150,000
element, 256 processor First Code Improvement
Milestone run was done for 5,000 time steps. In the scaling
directory is a data table giving the walltime for all 36 scaling runs, as well
as five plots showing dependence of walltime, speedup, efficiency, and overhead
on number of elements and number of processors.
Table 2: Table 2 shows the
problem sizes and run times of PARK on chapman using several different
numbers of processors. In order to make
this scaling study, the milestone problem was reduced to 100 time steps (rather
than 5,000) These numbers form the basis for all the plots
shown in Figure 3.
Figures 3a, 3b, 3c (below): Several ways of
presenting how run time varies with problem size and number of processors for
PARK on chapman (Table 2).
Figure 3: Scaling behavior for
PARK, examined as walltime,
speedup, efficiency, and overhead.
The scaling data show that not much speedup is
gained by going from one to two processors., and this Further work to understand this
behavior could allow us nearly a factor of twosuggests
that we need to examine why this is true, since it could lead to a
understanding of the behavior that might allow us nearly a factor of two i in
efficiency in future runs. For the largest job (150,000)
elements, the efficiency and overhead are nearly constant from 2-8 processors.
Efficiency falls off between 8 and 16 processors and is constant from 16-32
processors. For 64, 128 and 256 processors, efficiency falls off
gradually. This falloff is presumably due to an insufficient number of elements
per processor, the numbers being 2343, 1171, and 585, respectively, as the
plots show. This effect is seen even more dramatically for the jobs with a
smaller number of elements because, as the number of
processors increases, the number of elements
per processor gets so small that a large amount of time is spent communicating
between processors. The falloff bewteenbetween 8 and 16 processors on
the 150,000- element
problem suggests that the optimum number of elements per processor may be about
20,000.
Achieving the First Code Improvement Milestone is
significant because it opens the way to run significantly sized problems now that
the earthquake code as well as the Fast Multipole library run in parallel under
MPI. For the first time it presents to
the scientific community fast parallel codes that allow creating simulations of
the entire earthquake cycle on a fault in a 3D model that uses
the most accurate description of fault friction, rate and state friction, and
the quasi-dynamic radiation damping approximation to full elastodynamics. We
now have the potential for greatly increasing the number of elements that can
be included in the model over what could be done in the past.
This means that eEnough elements can now be
used that is it possible to represent a reasonably sized fault with elements
that are small enough that they can properly represent the behavior of a
continuum. Larger numbers of elements also allow occurrence in the simulation of
earthquakes with a large range of sizes in the simulation. This means that iIt will now be possible to study in
the simulateions in what situations small earthquakes occurring in isolation and ones that in what situations they
may cascade or grow into larger ones. It is currently not understood what causes small earthquakes to
grow into large ones or stop at small events.
Hence, these new simulations
should be key to understanding earthquake rupture processes.
This
could help gain an understanding of whether patterns of microseismicity might
be used to help predict earthquakes. The attainment of this milestone not only
represents an advance in our computational ability to simulate earthquakes, itbut will allow us to
understand the earthquake process better by creating simulated data sets that can be
compared with data on real earthquakes. The attainment of the next milestone
(Second Code Improvement) will involve increasing the efficiency of the code in
other ways, now that the parallel implementation has been achieved, and this
will allow even larger and more realistic simulations to be run.
Computationally,
we note this is the first simulation of an earthquake fault using the fFast mMultipole technique, using
a library originally developed for astrophysical, gravitationally- interacting bodies. As shown in Figure 4, this
technique leads to enormous savings over traditional full-interaction methods. Note that the mMultipole method results in an
improvement in scaling on a single processor, and that the performance is continually increased by employing additional processors (for
sufficiently large problems).
Figure 4: PARK: Scaling of time required
per time step using traditional N
squared full interaction method, the Mmultipole method on one processor, and the parallel
improvement found with additional processors (to 256).
All of the necessary material that describes the First
Code Improvement Milestone can be found within the appropriately named subdirectories under
the 1st_Code_Improv_Milestone directory in the public server at http://www.servogrid.org/slide/GEM/PARK .(or on NASA machines turing or chapman).
Instructions are included that will allow duplication of the results.
Included in the "in" and "out"
directories are all the materials from the First Code Improvement Milestone run
with 150,000 elements and 256
processors for 5,000 time steps. For code testing purposes on one's
own system it is useful to set the number of time steps in the prk.dat.150003
file to a smaller number than 5,000 for the initial run; even 2 would be reasonable
for the first run.
The materials in these
directories include:
“Milestone_Certification_Data.txt” - a file that gives the time required for the
First Code Improvement run and describes various parameters of the run.
“README-setting_up_input_files.txt” - a file that tells one
how to understand the input files including an explanation of how the elements
are created from the input files.
README-Compile.txt - a file that tells how
to create both the Mmultipole library and the PARK fault files using
the appropriate Makefiles.
in - a directory that
contains the input files that were used in the First Code Improvement run.
out - a directory that
contains the output files that were generated in the First Code Improvement
run.
src-bin - a directory that
contains the PARK and related fault application files used in the First Code
Improvement run. The versions of this directory on turing and chapman
also have the object files and executable binary file (named “park”).
Chapman is described at http://www.nas.nasa.gov/About/Profile/resources.html. Particulars relevent to
this run (current as of November 2003) are:
Chapman, an SGI Origin 3000, is
currently the only 1,024-processor single-image, shared-memory system in
existence, with one operating system and a single address space. Chapman will
become a major component of the IPG
(Information Power Grid), and is
currently being used to demonstrate that applications can scale to 1,024
processors on this machine. The system has 128 gigabytes of main memory, and 2
terabytes of FC Raid disk storage.
Salmon, John K, and Michael S. Warren, Parallel
out-of-core methods for N-body simulation. In Michael Heath, Virginia, Torczon.
et. al., editiors, Eighth SIAM Conference
on Parallel Processing for Scientific Computing, SIAM, 1997.
Michael,
S. Warren, John K. Salmon, Donald J. Becker, M. Patrick Goda, Thomas Sterling,
and Gregoire S. Winckelmas. PentiumPro Inside: I. a treecode at 430 Gflops on
ASCI red, II. Price/performance of $50/Mflop on Loki and Hyglac. In Supercomputing, ’97, Los Alamos, 1997,
IEEE Comp. Soc.
GeoFEST simulates stress
evolution, fault slip and plastic/elastic processes in realistic materials. The
products of such simulations are synthetic observable time-dependent surface
deformation on scales from days to decades. Scientific applications of the code
include the modeling of static and transient co- and post-seismic Earth
deformation, Earth response to glacial, atmospheric and hydrological loading,
and other scenarios involving the bulk deformation of geologic media.
Diverse types of synthetic
observations will enable a wide range of data assimilation and inversion
techniques for ferreting out subsurface structure and stress history. In the short term, such a tool allows
rigorous comparisons of competing models for interseismic stress evolution, and
the sequential GeoFEST system is being used for this at JPL and UC Davis. Parallel implementation is required to go
from local, single-event models to regional models that cover many earthquake events
and cycles.
GeoFEST
uses stress-displacement finite elements to model stress and flow in a
realistic model of the Earth's crust and upper mantle in a complex region such
as the Los Angeles Basin. The model includes stress and strain due to the
elastic response to an earthquake event in the region of the slipping fault,
the time-dependent viscoelastic relaxation, and the net effects from a series
of earthquakes. The physical domain may be two or three dimensional and may
contain heterogeneous materials and an arbitrary network of faults. The physics models
supported by the code include isotropic linear elasticity and both Newtonian
and power-law viscoelasticity via implicit/explicit quasi-static time stepping.
In addition to triangular, quadrilateral, tetrahedral and hexahedral continuum
elements, the program supports split-node faulting, body forces and surface
tractions.
The GeoFEST problem geometry for Milestone G is shown in Figure 1 below. Elastic and viscoelastic deformation is
simulated for 16 million finite elements,
for a duration of 1000 time
steps. The significant increase is in
the number of elements, which was about 55,000 for the baseline case and 1.4 million elements
for the Milestone F code improvement (both of which also run 1000 time steps). Note that the baseline case was modeled on
the 1994 Northridge earthquake, using a single fault of 300 square km in a
domain of 240 x 240 x 100 km. The Milestone F demonstration case is
based on the 1992 Landers event, using three closely arranged faults within an
865 square km area in a domain volume area of 1000 x 1000 x 60 km.] The Milestone G demonstration is still based on the 1992 Landers
event – but now has higher mesh density near
the faults, and 200 km depth, to reduce boundary effects for the viscoelastic
relaxation.
GeoFEST
Figure 1: Finite element mesh
LandersDeep for GeoFEST milestone G code improvement problem. Colors indicate
partitioning among processors (limited to 16 processors in this image for clarity,
although 490 processors were actually used). Partitions cluster near domain center due to
the high mesh density that is used near the faults.
GeoFEST
Figure 2: Top view (X-Y surface plane) of LandersDeep mesh in
Figure 1 (above). This is the central portion of the mesh that
has been zoomed in to provide detail on the region of active, slipping faults. Since the number of
elements per partition is roughly equal, smaller regions represent areas of
highest element density where strain energy is largest near the faults. Every partition shown in
the image, independent of color, is uniquely owned by a specific processor. For example, each of the
four red partitions belongs to distinct processors.
GeoFEST
Figure 3: Detail of top view (X-Y surface plane) of the 16 million element LandersDeep mesh created by strain energy
solution-adaptive refinement using PYRAMID. The close approach of two neighboring faults is shown. The previous code
improvement Milestone F used the GuiVISCO sequential mesh generator to produce a ~1.4
million element
mesh. The input
mesh for Milestone G, also created with that tool, was limited to nearly 10 million elements. The mesh shown above required use of the parallel adaptive
approach.
Figure 4: Top view
of vertical displacement of elastic response of region to Landers earthquake based on Milestone G simulation using
LandersDeep mesh.
Figure 5: GeoFEST simulated surface
displacement from coseismic Landers model, displayed as InSAR fringes.
Figure 6: GeoFEST
simulated postseismic surface displacement from Landers model after 500 years
of viscoelastic relaxation (at the end of the GeoFEST Milestone G case of Table 1). Color
scale of InSAR fringes is that of Figure 5.
GeoFEST simulates stress
evolution, fault slip and plastic/elastic processes in realistic materials. The
products of such simuluations are synthetic observable time-dependent surface
deformation on scales from days to decades. Scientific applications of the code
include the modeling of static and transient co- and post-seismic Eearth deformation, Earth response
to glacial, atmospheric and hydrological loading, and other scenarios involving
the bulk deformation of geologic media.
Diverse types of synthetic
observations will enable a wide range of data assimilation and inversion
techniques for ferreting out subsurface structure and stress history. In the short term, such a tool allows
rigorous comparisons of competing models for interseismic stress evolution, and
the sequential GeoFEST system is being used for this at JPL and UC Davis. Parallel implementation is required to go
from local, single-event models to regional models that cover many earthquake
events and cycles.
GeoFEST
uses stress-displacement finite elements to model stress and flow in a
realistic model of the Eearth's crust and upper mantle in a complex region
such as the Los Angeles Basin. The model includes stress and strain due to the
elastic response to an earthquake event in the region of the slipping fault,
the time-dependent viscoelastic relaxation, and the net effects from a series
of earthquakes. The physical domain may be two or three dimensional and may
contain heterogeneous materials and an arbitrary network of faults. The physics models
supported by the code include isotropic linear elasticity and both Newtonian
and power-law viscoelasticity, via implicit/explicit quasi-static time
stepping. In addition to triangular, quadrilateral, tetrahedral and hexahedral
continuum elements, the program supports split-node faulting, body forces and
surface tractions.
GeoFEST reads in a
tetrahedral mesh and information on boundary conditions, faulting events and
variations in time. The equilibrium conditions are computed based on
solution of the elastostatic equations through the finite element method. For
the Milestone G computation, a metric of strain energy per element is used to
indicate where the mesh needs to be refined. PYRAMID produces a refined mesh
(partitioned among the processors), which becomes the basis for a new
equilibrium elastic solution and the subsequent simulation. Then
viscoelastic evolution of the stress field is computed based on an implicit
technique applied on a series of time steps. User-indicated parts of the
displacement and stress fields are collected to a single ordered file at
requested time steps.
GeoFEST reads in a tetrahedral mesh and information on boundary
conditions, faulting events and variations in time. The equilibrium conditions are computed based on solution of the
elastostatic equations through the finite element method. Then viscoelastic evolution of the stress
field is computed based on an implicit technique applied on a series of time
steps.
Details are found in the
GeoFEST User’s Guide, which is posted at:
http://www.openchannelsoftware.org/projects/GeoFEST and
http://www-aig.jpl.nasa.gov/public/dus/quakesim/GeoFEST_User_Guide.pdfhttp://quakesim.jpl.nasa.gov/milestones.html. http://quakesim.jpl.nasa.gov/GeoFEST_User_Guide.pdf.
In
brief, the elastostatic solution is computed using
standard elastostatic finite elements, as found in (e.g.) Hughes (2000). The sparse positive definite system of equations is
solved by the
Ddiagonally pPreconditioned Conjugate
Gradient
(DPCG)
method.
For the viscoelastic
time steps we
form a modified stiffness matrix and right-hand side terms according to the
method of Hughes and Taylor (1978), which again results in a positive definite
system. Each time-step solution is found by the same DPCG function.
Slip on faults is accommodated
by a split node technique (Melosh and Raefsky, 1981) that modifies the
right-hand side of the matrix system at nodes local to the fault that slips.Slip on faults is accommodated
by a split node technique that modifies the right-hand side of the matrix
system at nodes local to the fault that slips.
GeoFEST
We have linked GeoFEST to
the PYRAMID library
(http://www.openchannelsoftware.org/projects/PYRAMID) and rely on PYRAMID functions for parallel
mesh refinement, as well as parallel domain partitioning and communication
between nodes. The initial mesh (constructed according to the User's Guide) and
a derived connectivity file (produced from the initial mesh by the application
gfmeshparse, available with the GeoFEST software) are read by GeoFEST and mesh
structures are passed to PYRAMID functions. PYRAMID is used for partition
information, resulting in each processor having a compact segment of the finite
element domain for its formation of the local part of the stiffness matrix and
solution. PYRAMID is also used for
combining these local solution estimates and conjugate gradient vectors at each
iteration within of each time step using the "globalize" function
that combines the local information and ensures each processor has valid
data. When used with mesh refinement enabled, GeoFEST uses a strain energy per element
threshold to mark the elements requiring refinement. All node-and
element-based mesh properties (including split nodes) are written to PYRAMID attributes for
preservation through the refinement process. PYRAMID refines the mesh and
passes the mapped attributes back to GeoFEST, which updates its structures to
represent the problem on the new mesh. The new mesh is used to solve
again the problem at the current time step, and is the basis for the subsequent
simulation steps. When the output displacement and stress fields are
required, a final merging within PYRAMID allows GeoFEST to write a single result file with
every node and element represented correctly and uniquely.
Performance is dominated
by the elastic and viscoelastic iterative matrix solution. The scaling of
this parallel iterative solution is explained in terms of component operations
in the Milestone F report. PYRAMID refinement does not significantly affect scaling
when done once, early in the simulation, as it is in the Milestone G
computation. Overall scaling is shown
in Figure 7 (the GeoFEST scaling plot).
We have
linked GeoFEST to the Pyramid library
(http://www.openchannelsoftware.org/projects/Pyramidhttp://www.openchannelsoftware.org/projects/Pyramid) and rely on Pyramid functions for parallel domain
partitioning and communication between nodes. This results in
one change in processing, and some changes in the code.
There is now a preprocessing step before GeoFEST is
run on parallel machines. This is a
command-line invocation of an application “gfmeshparse”, which
derives a full connectivity description from the GeoFEST input file for
PYRAMID’s use. Note that in the
GeoFEST-4.3P download, this application is called “meshgen.”
Changes to GeoFEST are chiefly in the use of
PYRAMID. PYRAMID is used for partition information, (resulting in each processor having a compact segment of the
finite element domain for its formation
of the local part of the stiffness matrix and solution)., PYRAMID is also used for combining these local
solution estimates and conjugate gradient vectors at each iteration within of each
time step using the (““globalize” is the function
that combines the local information and ensures each processor has valid data),. and aA final merging within PYRAMID that allows GeoFEST to write a single result file with every
node and element represented correctly and uniquely.
Figure 5 shows how time is spent on four processors
(stacked vertically) during the Conjugate Gradient iterations. Two iterations are shown, with blank (black)
areas indicating the domination of numerical calculation. Red shows the “WAITALL” state that results
when some processors finish local operations earlier than others and the
matrix-vector product is combined across processors that have adjacent
partition information (arrows show communication among these partition-adjacent
processors). Violet indicates the “ALLREDUCE”
function of parallel communication that is required to globally combine parts
of a vector dot product.
Beyond the interest in seeing the fingerprint of a
parallel conjugate gradient operation, this image shows the dominance of
computation over communication time (black >> colors), and the acceptable
(but not perfect) load balance of the partitioned work.
Figure 5: GeoFEST: Fine-detail image of the portion of a GeoFEST run
that represents most of the computer time, indicating good parallel
performance. Four processors are represented along the
vertical axis, and time is represented
along the horizontal axis (a 50 millisecond time clip is shown above). Two
iterations of the Conjugate Gradient algorithm, of the thousands of iterations
making up this simulation, are
shown. Three features indicate
this algorithm will scale to very large problems: 1) The computational load
(thin horizontal turquoise line on black background) is about the same for
each processor. 2) The time spent in synchronization and communication (red and violet) is a small
fraction of the total. The
white arrows indicate the inter-processor communication paths among processors
where such communication occurs only as needed. (This explains why some
processors spend more time in synchronization than others even though the
fraction of time is small.) 3) (not
visible in this plot) the fraction of time spent in communication does not grow
when problem size grows proportional to the number of processors used.Fine-detail image of
GeoFEST use of computer time
during two iterations of the Conjugate Gradient solver on four processors using
Jumpshot performance library. See
text for details.
Figure 5 shows how time is spent on four processors
(stacked vertically) during the Conjugate Gradient iterations. Two iterations are shown, with blank (black) areas
indicating the domination of numerical calculation. Red shows the “WAITALL”
state that results when some
processors finish local operations earlier than others and the matrix-vector product is combined across processors which have adjacent partition
information (arrows show communication
among these partition-adjacent processors). Violet
indicates the ALLREDUCE function of parallel
communication that is required to globally combine parts of a vector dot
product.
Beyond the interest in seeing the fingerprint of a parallel conjugate gradient
operation, this image shows the dominance of computation over communication
time (black >> colors), and the acceptable (but not perfect) load balance
of the partitioned work.
Directions for running the Milestone G test case
are below.
GeoFEST processing for the benchmark case naturally
divides into three phases:
·
Input, including
reading the mesh file and creating the mesh partition data for each processor;
·
Elastic solution and mesh
refinement, where
the initial equilibrium state is computed;
·
and Viscoelastic
evolution, where a sequence of time steps follow the relaxation of shear
stresses in viscous material.
Figure 6 shows where time is spent for the mMilestone G case (64490 processors, 1.46 million elements, 1,000 time steps)[2]. The preponderance of time is taken up by the
viscoelastic time step portion of the code.
It is clear that unless many more processors are used (or many fewer
time steps of simulation), the scaling of the viscoelastic steps determines the
time one may expect for a simulation to complete.
GeoFEST
Code |
Input Time (seconds) |
Elastic Time (seconds) |
Viscoelastic Time (seconds) |
GeoFEST Milestone F |
186.2 |
3.5 |
7,247.6 |
GeoFEST Milestone G |
1,273.8 |
169.7 Elastic/Refine/Elastic |
41,838.0 |
Table 2. The Milestone G
case has proportionally larger input time. But this confirms our original design
concept: if we didn't have adaptive mesh refinement, we'd eventually be
dominated by input time, because that's sequential. We could have started with
a mesh like the 1.4 million element milestone F mesh, and reduced our input
time to about the same (186 s); that would increase our Elastic/Refine/Elastic
time (to reach 16 M elements), but probably not by more than a few hundred
seconds. Such tradeoffs will be evaluated in future work.
A single Viscoelastic time step consists of
construction of a stiffness matrix (and right-hand side), followed by solution
of this matrix by the parallel conjugate gradient technique (many iterative
steps). So the Viscoelastic segment of
Figure 6 represents 1,000 time steps, each consisting of roughly 200 iterations
of the sparse solver. One iterative step consists chiefly of a sparse
distributed matrix vector multiply and three distributed vector dot products
(plus some scalar-vector operations that may be neglected).
We show in Table 3 the time for completing the time
step portion for several short (5 time step) runs of varying size, on Thunderhead. Problem sizes were constructed to
approximate a constant number of elements in each processor for 4, 16, and 64
processors. Figure 7 shows the speed of
solution for these cases: "work" is taken to be the operations
involved in the iterative steps. This
plot proves communication overhead is not growing faster than computational
time as we scale the problem size with number of processors. The result is
excellent scaling. Problems with > 4,000
elements per processor see negligible parallel overhead, and even smaller
problems show substantial speed up with many processors.
Figure
7 demonstrates that the GeoFEST scaling (on the dominating time-stepping loop)
is excellent. Sufficiently large
problems (such as the 16M element case) have trivial parallel overhead on 490
processors.
A single
Viscoelastic time step consists of construction of a stiffness matrix (and
right-hand side), followed by solution of this matrix by the parallel
conjugate gradient technique (many iterative steps). So the Viscoelastic segment of Figure 6 represents 1,000 time
steps, each consisting of roughly 200 iterations of the sparse solver. One
iterative step consists chiefly of a sparse distributed matrix vector multiply
and three distributed vector dot products (plus some scalar-vector operations
that may be neglected). In Figure 5, the portions represented by
turquoise lines on black are the local operations for
the sparse matrix-vector product (the finite element domain has been
partitioned among the processors in contiguous pieces). The red bars indicate synchronization and combining of the solutions where nodes are
shared among adjacent processors. The
violet represents the synchronization and communication that combines local dot products
into full values across all processors.
The operations required for construction of the
stiffness matrix are small. If we represent the number of elements as Nelts, and number of iterations as Niters,
stiffness construction operations number:
~3500*Nelts
= ~5 billion for the Nelts = 1.4 million
compared to the operations required for a
Niters = ~200 solution,
which has operation count
~300* Niters *300* Nelts = ~80 billion.
We show in Table 3 the time for completing the time
step portion for several short (5 time step) runs of varying size, on Thunderhead. Problem sizes were constructed to
approximate a constant number of elements in each processor for 4, 16, and 64
processors. Figure 7 shows the speed of
solution for these cases: "work" is taken to be the operations
involved in the iterative steps. This
plot proves communication overhead is not growing faster than computational
time as we scale the problem size with number of processors. The result is
excellent scaling. Problems with > 4,000
elements per processor see negligible parallel overhead, and even smaller
problems show substantial speed up with many processors.
GeoFEST
Figure
77: GeoFEST: Scaling of work (on linear
scales) in GeoFEST time-step function with number of processors on three sizes
of problems (on Thunderhead cluster computer, GSFC) plus two very large
problems (on Cosmos cluster computer, NASA
JPL). Blue indicates
ideal scaling (from 4 processors). Expressing work in operations/wallclock time scaled by processor speed allows comparison of sizes (and platforms) in a single plot.
Figure 6: Scaling of work (on linear scales) in GeoFEST
time-step function with number
of processors on three sizes of problems (on Thunderhead cluster computer,
GSFC). Blue indicates ideal scaling (from 4 processors). Expressing work in operations/wallclock time allows
comparison of sizes in single plot.
Figure 67 demonstrates that the GeoFEST scaling (on the
dominating time-stepping loop) is excellent.
Sufficiently large problems (353,000 and esxpecially 1.4 million
elements shown here) have trivial parallel
overhead on 64 processors.
More complete scaling analysis considers the time
for problem set-up, elastic solution, and writing results to files. Benchmark-type problems with 1,000 time steps are
dominated by the scaling behavior shown above, for the example problem (1.4
Million elements, 64 processors).
GeoFEST
GeoFEST has been demonstrated using adaptive mesh
refinement based on an element-wise strain energy metric, which concentrates
solution resources in the parts of the domain where high resolution is
required. Because approximate solutions determine the final mesh structure,
this allows higher accuracy solutions for larger domain sizes and greater
faulting complexity for any available computer resource. This is a substantial and necessary step in
allowing simulations of surface deformation in active tectonic areas, in that it
allows accurate solutions on regional scales (such as Southern California),
well beyond the scales of single fault segments used in past studies.The baseline problem was
modeled after the Northridge earthquake (single thrust fault). Analysis of individual earthquake events
with attention to their geographic settings is a long-term, important area of
simulation and research. But with the
ability to solve domains with millions of elements implies we can simulate regions with multiple faults, such as the
Los Angeles basin. Interactions among slipping faults and possible
emergent structures from these nonlinear interactions appears to be the next
advance in forecasting earthquake risk. This is a new and promising area for simulation,
data comparison and testing of concepts.
Many kinds of simulation codes are beginning to be employed for this new
kind of work across the earthquake community,. bBut a finite element code has close to the greatest
degree of flexibility in including the effects of realistic structures in the Eearth in a heterogeneous
domain. WSo we expect this improved
GeoFEST will have unique value in validating these other simulations and
determining when other multiple-fault models are leaving out too many material effects.
Computationally, we have demonstrated efficient parallel Conjugate
Gradient solutions for 3-D faulted-system finite elements, and linked our method with the PYRAMID
library. The parallel Conjugate Gradient is no surprise, as
it has been demonstrated in other domains of physics. But a
freely-available source code for
faulted domains will be helpful to the US simulation effort. Linking with PYRAMID is a convenient way to handle
issues of partitioning and communication.
More important, it is a first step to using the PYRAMID functions for
parallel adaptive mesh refinement.
Adaptive mesh refinement is essential to attaining high-quality
solutions to problems with widely varying stress intensities in three
dimensions. Parallel mesh refinement
has the additional advantage of solving problems with size commensurate with the
memory space of massively parallel computers without handling the associated
mesh files with solid meshing programs, which are nearly all written for
sequential machines. A domain can be described and meshed at a relatively low
mesh density, imported to the parallel system, and then key locations in the
mesh can be refined to the degree needed, expanding the number of elements by
large factors (possibly
hundreds or more).
The GeoFEST
code and milestone input file are available on the Goddard Space Flight Center
machine Thunderhead. Building the code on Thunderhead is accomplished by following
these steps:
Note: Use the INTEL compiler by typing
"mpi_env -p intel"
Get the code from ~nortonc/Milestone/GeoFEST.tgz
"tar xvzf GeoFEST.tgz" in your home
directory ~/
"tcsh" to use tcsh
cd ~/GeoFEST/Pyramid-3D/Pyramid-3D/ParMetis/ and
type "make"
cd ~/GeoFEST/Pyramid-3D/ and type "make -f
Intel"
cd ~/GeoFEST/geofest/ and type "make -f Intel"
type "exit" to leave tcsh
This results in an executable program called GeoFEST in
~/GeoFEST/geofest.
Follow these steps to run the code on Thunderhead.
Start LACE Task manager with "ltmsuper"
Move executable to your run directory
cp ~/GeoFEST/geofest/GeoFEST
~/rhome/GeoFEST/geofest/GeoFEST
Setup input (input.dat and input.dat.jpl for 4, 16
and 64 processors)
cd ~/rhome/GeoFEST/geofest/
ln -s
/data1/nortonc/Quakesim/GeoFEST/geofest/Visuals/LandersGr4.dat input.dat
ln -s /data1/nortonc/Quakesim/GeoFEST/geofest/Visuals/LandersGap4.dat.jpl
input.dat.jpl
The 16
processors case files are LandersGr16.dat and LandersGap16.dat.jpl
The 64
processors case files are LandersGr64.dat and LandersGap64.dat.jpl
Request processors and run the code
cd
~/GeoFEST/geofest/
ltmbegin -n
4 -m 120
ltmpi -p 1
GeoFEST input.dat /data1/<your loginid>/<output directory>
OR
ltmpi -p 1 GeoFEST input.dat /data1/<your
loginid>/<output directory> >& /data1/<your
loginid>/OUT.log
Note: -n means number of processors
-m means number of minutes requested
-p
1 means use one CPU per node (rather than 2 CPUs per node which is the default)
For the16 processor case use -n 16 -m 300
For the 64 processors case use -n 64 -m
360
Follow these steps to interpret the results.
When the code runs, some
welcome and diagnostic messages will be printed to the screen.
Visualization and the
conjugate gradient history files will be written in /data1/<your
loginid>/<output directory>.
For scalability timings, capture
("grep -i pyramid
OUT.log") the
output that looks like this (the bracketed items
are explanatory and the data will be on the screen and also in OUT.log if
you chose to save it as suggested above):
PYRAMID Current Time: xxxx.xx
[A. Loading Data Start]
PYRAMID Current Time: xxxx.xx
[B. Elastic Soln Start]
PYRAMID Current Time: xxxx.xx
[C. Time Stepping Start]
PYRAMID Current Time: xxxx.xx
[D. Program Termination]
The total runtime will be in step D.
In order to examine scalability, the key number is the
average time per iteration in the time stepping stage. Since the problem size
grows somewhat proportionally to the number of processors
in a scaled fashion 4, 16, 64, this number should look roughly constant across these problem sizes.
Compute
the average time per iteration as (D-C)/X
where X is the total number of iterations for each problem size
4 PE
case: X = 83,256
16 PE
case: X = 132,429
64 PE case: X = 195,914
As an aside, the
baseline case ran in about 13 hours for a problem about the same size as the 4
PE case above, so for scalability relative to the baseline case the 4 PE case
would have to run in less than 3 hours. Our experience has
been that the 4 PE case can run in 1 hour (dual cpu mode) or less
(single cpu mode).
Thunderhead is described at:
http://newton.gsfc.nasa.gov/thunderhead/index.htm. Particulars relevant to this run (current
as of November 2003) are:
This section has two parts. The first describes how
to run the milestone case on a unix/linux-based system with the Intel compilers
and MPI based on the LSF queuing system. The second part contains details that
prove that the milestone case was run successfully on such a system at JPL, the
Intel Pentium Xeon-based cluster called Cosmos.
Part I
Building the code on Cosmos is accomplished by
following these steps:
§
Download GeoFEST and
PYRAMID from Open Channel Foundation
·
http://openchannelfoundation.org/
§
Download ParMetis from
University of Minnesota
·
http://www-users.cs.umn.edu/~karypis/metis/parmetis/index.html/
§
"tar xvzf
GeoFEST.tgz" in your home directory ~/
§
Load the Intel 7.x compiler
and Myrinet modules
·
module purge
·
module load latest_intel71
·
module load
mpich-gm-intel71
§
Install ParMetis in
~/GeoFEST/Pyramid/Pyramid/ParMetis/
§
cd ~/GeoFEST/Pyramid/ and
type "make -f Makefile.Intel"
§
cd ~/GeoFEST/geofest/ and
type "make -f Makefile.Parallel"
This results in an executable program called
GeoFEST in ~/GeoFEST/geofest.
Follow these steps to run the code on Cosmos.
§
Setup input (input.dat and
input.dat.jpl for milestone meshes)
·
ln –s
/pathtofile/LandersDeep.dat input.dat
·
ln –s /pathtofile/LandersDeep.dat.jpl
input.dat.jpl
§
If these files are not
present on Cosmos, they may be retrieved from http://www.servogrid.org/slide/GEM/SERVO/GeoFESTBenchmarks/
and expanded with the unzip utility.
§
Submit an LSF batch job
#!/bin/sh
#BSUB -x -n 490 -R "span[ptile=1]" -W
13:00
#BSUB -a mpich_gm
#BSUB -q long
#BSUB -J GeoFEST
#BSUB -o /tmp/QSMG.outfile -e /tmp/QSMG.errfile
date
# NOTE: LSF starts in the current working directory
by default.
cd ~/GeoFEST/geofest
mpirun.lsf ./GeoFEST input.dat ~/
Note:
-x –R "span[ptile=1]" means use one CPU per node
-n means number of processors requested
-W hh:mm means wall clock time
requested in hours:minutes
Follow these steps to interpret the results.
§
When the code runs, some
welcome and diagnostic messages will be printed to the screen.
§
Visualization
and the conjugate gradient history files will be written in ~/
§
For scalability timings,
capture ("grep -i pyramid QSMG.outfile") the output that looks like
this (the bracketed items are explanatory and the data will be on the screen
and also in QSMG.outfile if you chose to save it as suggested above):
PYRAMID Current Time: xxxx.xx
[A. Loading Data Start]
PYRAMID Current Time: xxxx.xx [B. Elastic Soln Start]
PYRAMID Current Time: xxxx.xx
[C. Time Stepping Start]
PYRAMID Current Time: xxxx.xx
[D. Program Termination]
The total runtime will be in step D.
Cosmos is described at:
http://sc.jpl.nasa.gov/hardware/dell1750. Particulars relevant to this run are:
Cosmos is a 1024-processor Cluster located at the NASA/Jet Propulsion
Laboratory. It features 512 dual-CPU 3.2Ghz Pentium 4 Xeons, 2TB memory, 48TB disk space
and a 2+2Gpbs (i.e. bi-directional) myrinet fiber interconnection network.
Part II
The following are excerpts from the run script that
demonstrate that the milestone was successfully met.
Transcript from LSF log file:
Sender: LSF System
<lsfadm@n030>
Job <#!/bin/sh;#BSUB -x
-n 490 -R "span[ptile=1]" -W 24:00;#BSUB -a mpich_gm;#BSUB -q
test;#BSUB -o /scratch00/nortonc/Milestone/QSM.outfile -e
/scratch00/nortonc/Milestone/QSM.errfile
# my default stdout, stderr files; date;# NOTE: LS> was submitted
from host <cosmos> by user <nortonc>.
Started at Fri Mar 25 18:06:19 2005
Results reported at Sat Mar 26 06:08:31 2005
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
#!/bin/sh
#BSUB -x -n 490 -R
"span[ptile=1]" -W 24:00
#BSUB -a mpich_gm
#BSUB -q test
#BSUB -o /scratch00/nortonc/Milestone/QSM.outfile
-e /scratch00/nortonc/Milestone/QSM.errfile
# my default stdout, stderr files
date
# NOTE: LSF starts in the
current working directory by default.
cd
/home/nortonc/GeoFEST-4.5/geofest-MG
mpirun.lsf ./GeoFEST
input.dat /scratch00/nortonc/Milestone
------------------------------------------------------------
Successfully completed.
Resource usage summary:
CPU time :21053728.00
sec.
Max Memory : 257131
MB
Max Swap : 361327 MB
Max Processes : 490
Max Threads : 490
The output (if any)
follows:
Fri Mar 25 18:06:20 PST
2005
PYRAMID Unstructured AMR
Library
DEMO Version 1.15, 3/25/2005 at 18 hrs 6 min 44 sec
Initialized:
490 processors using double precision reals
************************************************
Welcome to the Geophysical
Finite Element Simulation Tool (GeoFEST 4.5) (./GeoFEST)
CPU time = 0.520000
Loading Data At: PYRAMID Current Time: 1.58786773681641D-004
Opening output file
/scratch00/nortonc/Milestone/LandersDeep.out ...
landers geometry, two layers, three faults, deeper
box
500 years, 8 print times
Number of nodes = 1902576
No surface tractions to process
. . .
Split nodes all processed
. . .
Generation of element
group #1 complete.
Doing shape functions...
Element generation
complete. Setting up linear algebra...
Equation setup completed.
Time data generation complete...
Starting Elastic Soln
At: PYRAMID Current Time: 1273.76971983910
Starting elastic
solution...
Initializing globalize
comm buffers:
Calling SOLVER
Elastic solution complete.
Starting Save Attributes.
save_attributes complete.
Starting Adaptive Refinement.
Adaptive mesh refinement complete.
Starting Update Phase.
Number of nodes = 2840427
Global b.c. map
generated...
Decomposed node and
equation totals:
Global number of elements = 16021034
Split nodes all processed
. . .
Generation of element
group #1 complete.
Starting Cleanup...
Starting Final Free...
Starting elastic
solution...
Initializing globalize
comm buffers:
Calling SOLVER
Elastic solution complete.
CPU time = 585.400000
Starting Time_Step Soln
At: PYRAMID Current Time: 1443.54365301132
Starting time-stepping
algorithm...
Performing scheduled
output at time = 2.000000 ...
Output step finished.
Performing scheduled
output at time = 10.000000 ...
Output step finished.
Performing scheduled
output at time = 50.000000 ...
Output step finished.
Performing scheduled
output at time = 100.000000 ...
Output step finished.
Performing scheduled
output at time = 200.000000 ...
Output step finished.
Performing scheduled
output at time = 300.000000 ...
Output step finished.
Performing scheduled
output at time = 400.000000 ...
Output step finished.
Performing scheduled output at time = 500.000000
...
Output step finished.
Normal program termination.
Finishing Time_Step Soln At: PYRAMID Current Time: 43281.3791708946
Transcript from cghist (conjugate gradient history)
file:
time=0 converged in 393
iterations norm reduced 9.88859e-08
starting=558122 ,
ending=0.0551904
time=0 converged in 429
iterations norm reduced 9.92227e-08
starting=710706 ,
ending=0.0705182
time=0.5 converged in 394
iterations norm reduced 9.87508e-08
starting=710648 ,
ending=0.070177
time=1 converged in 918
iterations norm reduced 9.8854e-08
starting=0.357578 ,
ending=3.5348e-08
time=1.5 converged in 799
iterations norm reduced 9.94214e-08
starting=0.149331 ,
ending=1.48467e-08
…
time=498 converged in 943
iterations norm reduced 9.94877e-08
starting=0.00420323 ,
ending=4.18169e-10
time=498.5 converged in
948 iterations norm reduced 9.91846e-08
starting=0.000210996 ,
ending=2.09276e-11
time=499 converged in 941
iterations norm reduced 9.92404e-08
starting=2.78625e-05 ,
ending=2.76508e-12
time=499.5 converged in
918 iterations norm reduced 9.76287e-08
starting=3.38485e-06 ,
ending=3.30459e-13
time=500 converged in 834
iterations norm reduced 9.84679e-08
starting=4.34575e-07 ,
ending=4.27916e-14
Thunderhead is a
512-processor Commodity Cluster located at NASA/Goddard Space Flight
Center. It features 512 2.4Ghz Pentium 4 Xeons, 256Gb DDR memory, 20Tb
disk space and 2.2Gpbs myrinet fiber interconnect.
Scientific Significance
Achieving
the First Code Improvement Milestone is significant because it opens the way to
run significant sized problems now that the earthquake code as well as the Fast
Multipole library run in parallel under MPI.
For the first time it presents to the scientific community fast parallel
codes that allow creating simulations of the entire earthquake cycle in a 3D
model that uses the most accurate description of fault friction, rate and state
friction, and the quasi-dynamic radiation damping approximation to full
elastodynamics. We now have the potential for greatly increasing the number of
elements that can be included in the model over what could be done in the past.
This
means that enough elements can now be used that is it possible to represent a
reasonably sized fault with elements that are small enough that they can
properly represent the behavior of a continuum. Larger numbers of elements also
allow occurrence in the simulation of earthquakes with a large range of sizes.
This means that it will be possible to study in the simulations in what situations
small earthquakes occur in isolation and in what situations they may cascade or
grow into larger ones. This could help gain an understanding of whether
patterns of microseismicity might be used to help predict earthquakes. The
attainment of this milestone not only represents an advance in our
computational ability to simulate earthquakes, it will allow us to understand
the earthquake process better by creating data sets that can be compared with
data on real earthquakes. The attainment of the next milestone (Second Code
Improvement) will involve increasing the efficiency of the code in other ways,
now that the parallel implementation has been achieved, and this will allow
even larger and more realistic simulations to be run.
Computational Significance
Demonstration of
performance goals:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Chapman is described at
http://www.nas.nasa.gov/About/Profile/resources.html . Parculars
relevent to this run (current as of November 2003) are: "Chapman, an SGI
Origin 3000, is currently the only 1,024-processor single-image, shared-memory
system in existence, with one operating system and a single address space.
Chapman will become a major component of the IPG, and is currently being used
to demonstrate that applications can scale to 1024 processors on this machine.
The system has 128 gigabytes of main memory, 2 terabytes of FC Raid disk
storage."
Thunderhead
is described at http://newton.gsfc.nasa.gov/thunderhead/index.htm.
Particulars relevent to this run (current
as of November 2003) are:
"Thunderhead is a 512-processor
Commodity Cluster located at NASA/Goddard Space Flight Center. It
features 512 2.4Ghz Pentium 4 Xeons, 256Gb DDR memory, 20Tb disk space and
2.2Gpbs myrinet fiber interconnect. "
GeoFEST RReferences
[1] Thomas J. R. Hughes
and Robert Taylor, “Unconditionally Stable Algorithms For Quasi-Static
Elasto-Plastic Finite Element Analysis.” Computers & Structures, Vol. 8,
pp. 169-173, 1978.
[2] Thomas
J. R. Hughes, “The Finite Element Method: Linear Static and Dynamic Finite
Element Analysis.” Dover, Publication, INC., Mineola, New York, 2000.
[1] Runge-Kutta is a method for forward integration of differential equations that involves calculating derivatives of the functions at the current time and several fractions of potential time-steps in the future, appropriately weighting these derivatives estimating the best derivative value to use and determining the value of the function at the new time by multiplying that best derivative by the appropriate time-step. The fifth-order Runge Kutta method compares estimates made using two different time-steps and, based on this comparison, determines whether a smaller or larger time-step should be used for the next step. This allows for adaptive time-stepping which is extremely important in problems such as this where the time-steps can vary as much as ten orders of magnitude, depending on whether interseismic or a coseismic behavior is involved.
[2] For 8 of the 1,000 time steps, a snap-shot of full displacement data was captured to
a file (as wads done in the baseline case for Milestone E).