CHAPTER 1 Introduction

Overview

The Climate Data Management System is an object-oriented data management system, specialized for organizing multidimensional, gridded data used in climate analysis and simulation.

CDMS is implemented as part of the Climate Data Analysis Tool (CDAT), which uses the Python language. The examples in this chapter assume some familiarity with the language and the Python Numeric module (http://numpy.sf.net). A number of excellent tutorials on Python are available in books or on the Internet. For example, see http://python.org .

Variables

The basic unit of computation in CDMS is the variable. A variable is essentially a multidimensional data array, augmented with a domain and a set of attributes. As a data array, a variable can be sliced to obtain a portion of the data, and can be used in arithmetic computations. For example, if u and v are variables representing the eastward and northward components of wind speed, respectively, and both variables are functions of time, latitude, and longitude, then the velocity for time 0 (first index) can be calculated as

>>> from cdms import MV

>>> vel = MV.sqrt(u[0]**2 + v[0]**2)

This illustrates several points:

File I/O

A variable can be obtained from a file or collection of files, or can be generated as the result of a computation. Files can be in any of the self-describing formats netCDF, HDF, GrADS/GRIB (GRIB with a GrADS control file), or PCMDI DRS. (Depending on your local installation, HDF and DRS may or may not be enabled.) For instance, to read data from file sample.nc into variable u :

>>> import cdms

>>> f = cdms.open('sample.nc')

>>> u = f('u')

Data can be read by index or by world coordinate values. The following reads the n-th timepoint of u (the syntax slice(i,j) refers to indices k such that i <= k < j):

>>> u0 = f('u',time=slice(n,n+1))

and this reads u at time 366.0:

>>> u1 = f('u',time=366.)

A variable can be written to a file with the write function:

>>> g = cdms.open('sample2.nc','w')

>>> g.write(u)

<Variable: u, file: sample2.nc, shape: (1, 16, 32)>

>>> g.close()

Domains and Axes

The spatial and temporal information associated with a variable is represented by the variable domain, an ordered tuple of axes and/or grids. In the above example, the domain of the variable u is the tuple (time, latitude, longitude). This indicates the order of the dimensions, with the slowest-varying dimension listed first (time).

Each element of the tuple is an axis. An axis is like a 1-D variable, in that it can be sliced, and has attributes. A number of functions are available to access axis information. For example, to see the list of time values associated with u:

>>> t = u.getTime()

>>> print t[:]

[ 0., 366., 731.,]

Attributes

As mentioned above, variables can have associated attributes. In fact, nearly all CDMS objects can have associated attributes, which are accessed using the Python dot notation:

>>> u.units='m/s'

>>> print u.units

m/s

Attribute values can be strings, scalars, or 1-D Numeric arrays.

When a variable is written to a file, not all the attributes are written. Some attributes, called internal attributes, are used for bookkeeping, and are not intended to be part of the external file representation of the variable. In contrast, external attributes are written to an output file along with the variable. By default, when an attribute is set, it is treated as external. To see the list of external attribute names:

>>> print u.attributes.keys()

['datatype', 'name', 'missing_value', 'units']

The Python dir command lists the internal attribute names:

>>> dir(u)

['_MaskedArray__data', '_MaskedArray__fill_value', ..., 'id', 'parent']

In general internal attributes should not be modified directly. One exception is the id attribute, the name of the variable. It is used in plotting and I/O, and can be set directly.

Masked values

Variables can have an optional mask which represents a portion of data that is missing. If present, the mask of a variable is an array of ones and zeros, of the same shape as the data array. A mask value of one indicates that the corresponding data array element is missing or invalid.

Arithmetic operations in CDMS take missing data into account. The same is true of the functions defined in the cdms.MV module. For example:

>>> a = MV.array([1,2,3]) # Create array a, with no mask

>>> b = MV.array([4,5,6]) # Same for b

>>> a+b

variable_13

array([5,7,9,])

>>> a[1]=MV.masked # Mask the second value of a

>>> a.mask() # View the mask

[0,1,0,]

>>> a+b # The sum is masked also

variable_14

array(

data = [5,0,9,],

mask = [0,1,0,],

fill_value=[0,]

)

When data is read from a file, the result variable is masked if the file variable has a missing_value attribute. The mask is set to one for those elements equal to the missing value, zero elsewhere. If no such attribute is present in the file, the result variable is not masked.

When a variable with masked values is written to a file, data values with a corresponding mask value of one are set to the value of the variable's missing_value attribute. The data and missing_value attribute are then written to the file.

Masking is covered in See MV module.. Also see the documentation on the Python Numeric and MA modules, on which cdms.MV is based, at http://numpy.sourceforge.net .

File Variables

A variable can be obtained either from a file, a collection of files, or as the result of computation. Correspondingly there are three types of variables in CDMS:

A typical use of file variables is to inquire information about variables in a file without actually reading the data for the variables. A file variable is obtained by applying the slice operator [] to a file, with the name of the variable, or with the getVariable function. Note that obtaining a file variable does not actually read the data array:

>>> f = cdms.open('sample.nc','r+')

>>> u = f.getVariable('u') # or u=f['u']

>>> u.shape

(3, 16, 32)

File variables are also useful for fine-grained I/O. They behave like transient variables, but operations on them also affect the associated file. Specifically:

 

>>> f = cdms.open('sample.nc','r+') # Open read/write

>>> uvar = f['u'] # Note square brackets

>>> uvar.shape

(3, 16, 32)

>>> u0 = uvar[0] # Reads data from sample.nc

>>> u0.shape

(16, 32)

>>> uvar[1]=u0 # Writes data to sample.nc

>>> uvar.units # Reads the attribute

'm/s'

>>> uvar.units='meters/second' # Writes the attribute

>>> u24 = uvar(time=24.0) # Reads data

>>> f.close() # Save changes to sample.nc (I/O may be buffered)

In an interactive application, the type of variable can be determined simply by printing the variable:

>>> rlsf # Transient variable

rls

array(

array (4,48,96) , type = f, has 18432 elements)

>>> rlsg # Dataset variable

<Variable: rls, dataset: mri_perturb, shape: (4, 46, 72)>

>>> prc # File variable

<Variable: prc, file: testnc.nc, shape: (16, 32, 64)>

Note that the data values themselves are not printed. For transient variables, the data is printed only if the size of the array is less than the print limit. This value can be set with the function MV.set_print_limit to force the data to be printed:

>>> smallvar.size() # Number of elements

20

>>> MV.get_print_limit() # Current limit

300

>>> smallvar

small variable

array(

[[ 0., 1., 2., 3.,]

[ 4., 5., 6., 7.,]

[ 8., 9., 10., 11.,]

[ 12., 13., 14., 15.,]

[ 16., 17., 18., 19.,]])

 

>>> largevar.size()

400

>>> largevar

large variable

array(

array (20,20) , type = d, has 400 elements)

 

>>> MV.set_print_limit(500) # Reset the print limit

>>> largevar

large variable

array(

[[ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.,]

... ])

The datatype of the variable is determined with the typecode function:

>>> x.typecode()

'd'

Dataset Variables

The third type of variable, a dataset variable, is associated with a dataset, a collection of files that is treated as a single file. A dataset is created with the cdscan utility. This generates an ASCII metafile that describes how the files are organized, and what metadata is contained in the files. In a climate simulation application, a dataset usually represents the data generated by one run of a general circulation or coupled ocean-atmosphere model.

For example, suppose data for variables u and v are stored in six files: u_2000.nc, u_2001.nc, u_2002.nc, v_2000.nc, v_2001.nc , and v_2002.nc . A metafile can be generated with the command:

% cdscan -x cdsample.xml [uv]*.nc

The metafile cdsample.xml is then used like an ordinary data file:

>>> f = cdms.open('cdsample.xml')

>>> u = f('u')

>>> u.shape

(3, 16, 32)

Grids and Regridding

Latitude-longitude grids are used for regridding variables. A grid encapsulates:

For example, to regrid variable u to a 96x192 Gaussian grid:

>>> u = f('u')

>>> u.shape

(3, 16, 32)

>>> t63_grid = cdms.createGaussianGrid(96)

>>> u63 = u.regrid(t63_grid)

>>> u63.shape

(3, 96, 192)

To regrid a variable uold to the same grid as variable vnew :

>>> uold.shape

(3, 16, 32)

>>> vnew.shape

(3, 96, 192)

>>> t63_grid = vnew.getGrid() # Obtain the grid for vnew

>>> u63 = u.regrid(t63_grid)

>>> u63.shape

(3, 96, 192)

Regridding is discussed in See Regridding Data..

Time types

CDMS provides extensive support for time values in the cdtime module. cdtime also defines a set of calendars, specifying the number of days in a given month.

Two time types are available: relative time and component time . Relative time is time relative to a fixed base time. It consists of:

For example, the time "28.0 days since 1996-1-1" has value= 28.0 , and units= "days since 1996-1-1". To create a relative time type:

>>> import cdtime

>>> rt = cdtime.reltime(28.0, "days since 1996-1-1")

>>> rt

28.00 days since 1996-1-1

>>> rt.value

28.0

>>> rt.units

'days since 1996-1-1'

A component time consists of the integer fields year, month, day, hour, minute , and the floating-point field second . For example:

>>> ct = cdtime.comptime(1996,2,28,12,10,30)

>>> ct

1996-2-28 12:10:30.0

>>> ct.year

1996

>>> ct.month

2

The conversion functions tocomp and torel convert between the two representations. For instance, suppose that the time axis of a variable is represented in units " days since 1979 ". To find the coordinate value corresponding to January 1, 1990:

>>> ct = cdtime.comptime(1990,1)

>>> rt = ct.torel("days since 1979")

>>> rt.value

4018.0

Time values can be used to specify intervals of time to read. The syntax time=(c1,c2) specifies that data should be read for times t such that c1<=t<=c2:

>>> c1 = cdtime.comptime(1990,1)

>>> c2 = cdtime.comptime(1991,1)

>>> ua = f['ua']

>>> ua.shape

(480, 17, 73, 144)

>>> x = ua.subRegion(time=(c1,c2))

>>> x.shape

(12, 17, 73, 144)

or string representations can be used:

>>> x = ua.subRegion(time=('1990-1','1991-1'))

Time types are described in See cdtime Module..

Plotting data

Data read via the CDMS Python interface can be plotted using the vcs module. This module, part of the Climate Data Analysis Tool (CDAT) is documented in the VCS reference manual. The vcs module provides access to the functionality of the VCS visualization program.

To generate a plot:

For example:

>>> import cdms, vcs

>>> f = cdms.open('sample.nc')

>>> f['time'][:] # Print the time coordinates

[ 0., 6., 12., 18., 24., 30., 36., 42., 48., 54., 60., 66., 72., 78., 84., 90.,]

>>> precip = f('prc', time=24.0) # Read precip data

>>> precip.shape

(1, 32, 64)

>>> w = vcs.init() # Initialize a canvas

'Template' is currently set to P_default.

Graphics method 'Boxfill' is currently set to Gfb_default.

>>> w.plot(precip) # Generate a plot

(generates a boxfill plot)

By default a boxfill plot of the lat-lon slice is produced. Since variable precip includes information on time, latitude, and longitude, the continental outlines and time information are also plotted.

The plot routine has a number of options for producing different types of plots, such as isofill and x-y plots. See See Plotting CDMS data in Python. for details.

Databases

Datasets can be aggregated together into hierarchical collections, called databases. In typical usage, a program:

Databases add the ability to search for data and metadata in a distributed computing environment. At present CDMS supports one particular type of database, based on the Lightweight Directory Access Protocol (LDAP).

Here is an example of accessing data via a database:

>>> db = cdms.connect() # Connect to the default database.

>>> f = db.open('ncep_reanalysis_mo') # Open a dataset.

>>> f.variables.keys() # List the variables in the dataset.

['ua', 'evs', 'cvvta', 'tauv', 'wap', 'cvwhusa', 'rss', 'rls', ...

'prc', 'ts', 'va']

Databases are discussed further in See Database..

Go to Main Go to Previous Go to Next