RTS: The FMS Regression Test Suite

This is the home page of the Regression Test Suite of the Flexible Modeling System (FMS). It introduces the RTS and provides the basis for its design. We describe a standard syntax to express an RTS experiment, and provide instructions for constructing a new experiment. We also provide links to the current entries in the RTS.

A printable (PDF) version of this document is available as /home/vb/tex/reports/rts.pdf.

What is the RTS?

The FMS Regression Test Suite (RTS) is a set of runs designed to assess FMS-based model configurations for correctness and performance. These will be continuously run by Modeling Services and will be used to track performance enhancements as they are delivered. The RTS spans all the model configurations run in production and responsible for much of the throughput at GFDL, In addition, it also includes those model configurations in consideration for future runs (e.g higher resolution). As well, other test configurations such as the solo (Held-Suarez) atmospheric models used for more direct tests of the FMS framework itself are included in the RTS.

Modeling Services is working closely with the GFDL Model Development Teams to ensure that the RTS remains current and correctly reflects the behaviour of FMS models in production, as well as those model configurations actively in consideration for future production runs. To this end, the RTS is closely linked with the Model Development Database.

The RTS and the Model Development Database

Users preparing a new experiment may prepare it for inclusion in the databse in the form of an RTS entry as described in xml. This may then be given to the associated Liaison from Modeling Services, who then takes responsibility for performing the RTS integrity tests, verifying that the source has been tagged in manner that guarantees indefinite reproduction, and then creates the entry in the database.

Design of the RTS

At a certain stage in the evolution of a modeling system, there is a shift in emphasis from development to production, and it becomes necessary to be systematic about tracking and understanding changes. Changes can be of various origins:

Scientific and algorithmic: These involve changes to model physics and algorithms and by definition ``change answers''. Changing answers here is defined as any result that is not bitwise identical to the parent version. Scientific changes intentionally produce new results; algorithmic changes improve performance (either in numerics or in execution time) and conform to a weaker definition of change in a climate modeling context: the resulting time trajectory of the simulation may not be bitwise identical to the parent version, but is equivalent in a statistical sense: the model ``has the same climate''.
Technical: These are changes to the modeling infrastructure (parallelism, I/O, etc) and typically should not ``change answers''. In the event that answers change, they are expected to the weaker ``statistical equivalence'' criterion above.
Platform: This includes both hardware migrations as well as software environment changes (operating system, compilers, libraries). It is typically not possible to maintain bitwise integrity across platform changes, but it is hoped that model results also conform to the ``statistical equivalence'' criterion.

The RTS is designed to be the system for delivering objective information about changes in FMS-based model configurations, and maintaining a timeline of this information. An RTS experiment guarantees repeatable, bitwise-identical integrations under controlled conditions defined below. Any changes, including those which produce the ``same climate'' need to be certified by the relevant science teams, and are represented as a new RTS experiment.

RTS experiment creation

An RTS experiment is created from a valid entry in the FMS Model Development Database.

RTS integrity tests

We currently define three sets of RTS integrity tests:

restartFreq: An RTS experiment must be bitwise identical on the same platform when the length of individual run segments is changed. For instance, two simulations run in segments of 2 days and 3 days respectively must match exactly every 6 days.
peCount: An RTS experiment must be bitwise identical on the same platform when run at different PE counts. The set of PE counts is of course restricted to the ones valid for that model.
It is possible that the configuration of a model for the peCount test is not identical to the production configuration, where we may choose to use more efficient, but irreproducible, configurations. While this is not recommended, it is noted that this is done in practice in at least one instance: the flag make_exchange_reproduce, controlling the reproducibility of the exchange grid, is set to .TRUE. here and .FALSE. in the standard run below. Where we use non-reproducing optimizations, the cost of the reproducibility option will be documented in the standard run.
initNaN: An RTS experiment must produce bitwise identical results when all declared floating-point variables are initialized to an invalid value (NaN). Most compilers permit this to be done through a compiler flag. NaN initialization may be turned off in production, as it might entail a computational cost.

Exact matches are checked by running the resdiff script on restart files. These are short runs, typically a few simDays long.

RTS performance tests

The only performance that matters is the throughput of the actual production run, for which the RTS is merely a proxy. We try to define a run that should hew closely to the actual throughput. Also, the only real measure of throughput is the actual time-to-solution, here defined in units of simYears/wallDay (simulated model years per day of wall clock time). The RTS does however also provide other numbers useful in understanding performance. We define four sets of RTS performance tests:

Standard run

The standard run is configured exactly as the production run defined in the database, and has the same run length, diagnostic output configuration, etc. It represents the best estimate of the ideal throughput of a production run. If the actual production run appears to be producing less throughput than the standard run predicts (and especially if it's producing more:-) it is an issue that needs to be immediately addressed.

We note that the future standard runscript for FMS-based model runs will be based on the RTS script, and should be identical to this standard run. We note also that in practice users modify runscripts often, and for excellent reasons. It is worthwhile checking frequently against the RTS standard run to see if one has inadvertently degraded (or enhanced) performance.

The standard run also provides information on the queue wait time associated with production runs. This information will be used to tune the queuing and scheduling on the HPCS.

Performance variance

We would like performance to be predictable, but in practice it may vary depending on various uncontrolled factors such as system contention. We measure this by running one of the short runs repeatedly at different times of day, different LSC nodes, etc., to measure intrinsic variance as well as to bring to light any systematic variation that may be a symptom of system health.

Instrumented runs

Every FMS run is instrumented by lightweight builtin timers (mpp_clock calls) whose output will be automatically collated by the RTS output processing scripts. We are undertaking more systematic instrumentation of the code by builtin timers.

In addition, we define additional instrumented runs based on performance analysis software. For the SGI platform, there are two types of performance analysis tools: Speedshop and Perfex. These are described in some detail in the SGI notes and in even more copious detail in SGI technical publications.

Scaling studies

The peCount integrity test results doubles as a scaling study.

Standard description of an RTS experiment

An RTS experiment is designed to run FMS model configurations under controlled conditions that closely resemble the production environment. We define a standard syntax for the run procedure so that all experiments are run under as similar conditions as possible. We have chosen to use XML as the syntax for describing an RTS experiment.

The HOWTO document describing how to set up an RTS experiment is available online.

RTS: current status

The following entries currently exist in the RTS:

am2p10
coupled_example
spectral_solo
bgrid_solo
MCM_qflux_a11c
MCM_qflux_a11c_Restore
LM2p0_H
LM2p0_H_nov02
om2p2
cm2h1
cm2_a10o1
cm2_a10o2
mom4_test1

Author: V. Balaji
Document last modified