Best Practices for Preparing Ecological Data Sets

Best Practices for Preparing Ecological and Ground-Based Data Sets to Share and Archive¹
Robert B. Cook, Richard J. Olson, Paul Kanciruk, and Leslie A. Hook
Environmental Sciences Division, Oak Ridge National Laboratory
October 2000

At the request of several field researchers, we have written guidance on data management practices investigators should perform during the course of data collection to improve the usability of their data sets. This guidance is tailored to those who perform ecological and other ground-based measurements, although many of the practices may be useful for other data collection activities.

We assembled what we feel are the most important practices that researchers could implement to make their data sets ready to share with global change researchers. These practices could be performed at any time during the preparation of the data set, but we suggest that researchers consider them before measurements are taken.

The seven best practices are:

Assign Descriptive File Names
Use Consistent and Stable File Formats
Define the Parameters
Use Consistent Data Organization
Perform Basic Quality Assurance
Assign Descriptive Data Set Titles
Provide Documentation

1. Assign Descriptive File Names

File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names should contain information such as project acronym, study title, location, investigator, year(s) of study, and file type. The file name should be provided in the documentation (Sect. 7) and in the first line of the header rows in the file itself.

Clear, descriptive, and unique file names may be important later when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. Avoid using file names such as mydata.dat or 1998.dat.

An example of a great file name: narsto_texas_pm2.5_study_1997-1998.csv

where NARSTO is the name of the project, Texas is the location, “PM2.5 Study” is the project name, 1997-1998 is the date of the study, and .csv is the file type (format).

When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. You may want to use similar logic when designing directory structures and names. Also, the data set title (see Sect. 6) should be similar to the data file name(s).

2. Use Consistent and Stable File Formats

Using ASCII file formats is the best way to ensure that field data are readable into the future. Use the same format throughout the file – don’t have a different number of columns or re-arrange the columns within the file. At the top of the file, include several header rows. The first row should contain the file name, data set title, author, date, and companion file names. Other header rows (column headings) should describe the content of each column, including one row for parameter names and one for parameter units.

Within the ASCII file, delimit the parameter fields using commas, pipes (|), tabs, or semicolons; these are listed in order of our preference. Avoid delimiters that also occur in the data fields. If this cannot be avoided, enclose data fields that also contain a delimiter in single or double quotes. Don’t include rows with summary statistics; it is best to put summary statistics, figures, and other comments in a separate file or in the documentation.

Some field researchers may generate raster data (image data or gridded GIS data). We don’t offer any general recommendations about raster data, except that the format needs to be clearly documented. Binary file formats are used for most raster data, especially large-volume raster data. For small-volume raster data (coarse resolution global data or fine resolution data of a field site), ASCII format may be appropriate.

If you cannot use ASCII or binary files formats, another option is non-proprietary public domain data formats such as NET-CDF or HDF. Both of these formats have been used extensively to date and are reasonably well supported with open source versions of the software needed to read and write these formats.

Whatever file format you use, be sure to thoroughly document the format (see Sect. 7).

3. Define the Parameters

In order for others to use your data, they must fully understand the parameters in the data set, including the parameter name, unit of measure, and format. The parameters reported in the data set need to have names that describe the contents. The documentation should contain a full description of the parameter. Use commonly accepted parameter names, for example, Temp for temperature, Precip for precipitation, Lat and Long for latitude and longitude. See the references in the Bibliography for additional examples. Also, be sure to use consistent capitalization (not temp, Temp, and TEMP in the same file) and use only letters and numerals in the parameter name. Because some software allows a limited number of characters, make sure that the first 8 characters are unique.

The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units but recognize that each discipline has its own commonly used units of measure. The critical aspect here is that the units be defined so that others understand what is reported. Within each data set, choose a format for each parameter, explain the format in the documentation, and use that format throughout the file.

We recommend the following formats for common parameters:

Dates: yyyymmdd, e.g., January 2, 1997 is 19970102.

Time: Use 24-hour notation (13:30 hrs instead of 1:30 p.m.). Report in both local time and Coordinated Universal Time (UTC). Include local time zone in a separate field. As appropriate, both the begin time and end time should be reported in both local and UTC time. Because UTC and local time may be on different days, we suggest that dates be given for each time reported.

Spatial Coordinates: Spatial coordinates should be recorded in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80° 30' 00" W longitude is is –80.500000. Make sure that all location information in a file uses the same coordinate system, including coordinate type, datum, and spheroid. Document all three of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS80 (World Geographic Reference System of 1980)). Mixing coordinate systems (e.g., NAD83 and NAD27 (North American Datum of 1927)) will cause errors in any geographic analysis of the data.

Elevation: Provide elevation in meters. Include detailed information on the vertical datum used (e.g.- North American Vertical Datum 1988 (NAVD 1988) or Australian Height Datum (AHD)).

Missing Values: Use decimal point (.) or extreme value (-9999). Do not use character codes in a numeric field. Use the same notation for each missing value in the data set. The codes should not be parameter specific. Supply a flag or tag in a separate field to define briefly the reason for missing data.

4. Use Consistent Data Organization

We recommend that you organize the data within a file in one of two ways. Whichever style you use, be sure to place each observation in a separate line (row). Most often each row in a file represents a complete record and the columns represent all the parameters that make up the record. This arrangement is similar to a spreadsheet or matrix. For example:

Station     Date        Temp  Precip
Units       YYYYMMDD    C     mm              
HOGI        19961001    12    0
HOGI        19961002    14    3.3               
HOGI        19961003    19    -9999

The final value of –9999 is a missing value code for this data set. If you use a coded value or abbreviation for a site or station (e.g., HOGI stands for Hog Island, VA), be sure to provide a definition, including spatial coordinates, in the documentation.

A second arrangement may be more efficient when most records do not have measurements for most parameters, that is, a very sparse matrix of data, with many missing values. In this arrangement, one column is used to define the parameter and another column is used for the value of the parameter. Other columns may be used for site, date, treatment, and units of measure. For example:

Station     Date       Parameter      Value   Unit
HOGI        19961001   Temp           12      C
HOGI        19961002   Temp           14      C
HOGI        19961001   Precip         0       mm
HOGI        19961002   Precip         3.3     mm

An important issue with data organization is the number of records in each file (file size). There are a number of factors that determine the optimal number of records in a file, and we don’t have any hard and fast rules. In general, keep a set of similar measurements together (e.g., same investigator, methods, and instruments) in one data set. Please do not break up your data into many small files, e.g., by month or by site if you are working with several months or sites. Instead, make month or site a parameter and have all the data in one large file. Researchers who later use your relatively large data file won’t have to process many small files individually. There is an upper limit to the size of files, though. Large files (on the order of several tens of thousands of records, or several megabytes) do become unwieldy and may be too large for some applications. These very large data files need to be broken into logical smaller files.

If you are collecting several different types of measurements at a site (e.g., leaf area index and above- and belowground biomass), place each type of measurement in a separate data set. For each data set, use similar data organization, parameter formats, and site names, so that users understand the interrelationships between data sets.

5. Perform Basic Quality Assurance

In addition to scientific quality assurance (QA), we suggest that you perform basic data QA on the data file:

Check file format by making sure the data are delimited/line up in the proper column.
Check file organization and descriptors to ensure that there are no missing values for key parameters (such as sample identifier, station, time, date, geographic coordinates). Sort the records by key data fields to highlight discrepancies.
Check the content of measured or derived values. Scan parameters for impossible values (e.g., pH of 74; negative values where negative values are not possible). Review printed copies of the data file(s) and generate time series plots to detect anomalous values.
Perform statistical summaries (frequency of parameter occurrence) and review results.
If location is a parameter (latitude/longitude) then use scatter plots or GIS software to map each location to see if there are any errors in coordinates.
Verify data transfers (from field notebooks, data loggers, or instruments). For data transfers done by hand, consider double data entry (entering data twice, comparing the two data sets, and reconciling any differences). Where possible compare summary statistics before and after transfers.

6. Assign Descriptive Data Set Titles

We recommend that data set titles be as descriptive as possible. When giving titles to your data sets and associated documentation, please be aware that these data sets may be accessed many years in the future by people who will be unaware of the details of the project.

Data set titles should contain the type of data and other information such as the date range, the location, and the instruments used. If your data set is part of a larger field project, you may want to add that name, too (e.g., LBA or SAFARI 2000). In addition, we recommend that the length of the title be restricted to 80 characters (spaces included) to be compatible with other global change data collections. The data set title should be similar to the name(s) of the data file(s) in the data set (see Sect. 1).

Some bad titles:

"The Aerostar 100 Data Set",
"Respiration Data",
"Amazonian Respiration Data"

A great title:

“LBA Respiration Data for Broadleaf Evergreen Trees in Rondonia, Brazil, 1999-2000”

7. Provide Documentation

The documentation accompanying your data set should be written for a user 20 years into the future--what does that investigator need to know to use your data? Write the document for a user who is unfamiliar with your project, methods, or observations.

To ensure that documentation can be read 20 years in the future requires that it be in a stable non-proprietary format. We recommend ASCII format for text. If figures, maps, equations, or pictures need to be included, use a non-proprietary document format such as html (hypertext markup language). Images, figures, and pictures may be included as individual gif (graphics interchange format) or jpg (Joint Photographic Experts Group) files. Stable proprietary formats such as rtf (rich text format) or pdf (portable document format) are a suitable last resort.

The documentation should be in a separate file that is identified in the data file. The name of the documentation file should be similar to the name of the data set file.

The data set documentation should provide the following information:

The name of the data set, which will be the title of the documentation (see Sect. 6)
The scientific reason why the data were collected
What data were collected
What instruments (including model and serial number) (e.g., rain gauge) and source (meteorological station) were used
Who collected the data and who to contact with questions (include e-mail and Web address if appropriate)
Who funded the investigation
The name(s) of the data file(s) in the data set (see Sect. 1)
How to cite the data set
Where and with what spatial resolution the data were collected. If codes are used for location, be sure to define the codes in the documentation (e.g., HOGI in Sect. 4)
When and how frequently the data were collected
How each parameter was measured or produced (methods), its units of measure, the format used for the parameters in the data set, the precision and accuracy if known, and the relationship to other data in the data set if appropriate (see Sect. 3)
What the environmental conditions were (e.g., cloud cover, atmospheric influences, etc.)
The data processing that was performed, including screening
Standards or calibrations that were used
Software (including version number) used to prepare the data set
Software (including version number) needed to read the data set
The quality assurance and quality control that have been applied (see Sect. 5)
Special codes used, including those for missing values (see Sect. 3) or for stations (see Sect. 4)
The date the data set was last modified
Summary statistics generated directly from the final file
Example file record
Pertinent field notes or other companion files; the names of the files should be similar to the documentation and data file names
Related or ancillary data sets
Known problems that limit the data’s use

Documentation can never be too complete.

Bibliography

Christensen, S.W., T.A. Boden, L.A. Hook, and M.-D. Cheng. 2000. NARSTO Data Management Handbook. ORNL/CDIAC 112/R2. Oak Ridge, TN. Available on-line at: http://cdiac.esd.ornl.gov/programs/NARSTO/narsto.html.

Kanciruk, P., R.J. Olson, and R.A. McCord. 1986. Quality Control in Research Databases: The US Environmental Protection Agency National Surface Water Survey Experience. In: W.K. Michener (ed.). Research Data Management in the Ecological Sciences. The Belle W. Baruch Library in Marine Science, No. 16, 193-207.

Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford. 1997. Non-Geospatial Metadata for Ecology. Ecological Applications. 7:330-342.

Michener, W.K. and J.W. Brunt (ed.). 2000. Ecological Data: Design, Management and Processing, Methods in Ecology, Blackwell Science. 180p.

ORNL DAAC. 2000. Guidelines for Producing Ecological Data Sets For Distribution and Archive. Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee. In review.

Porter, J.H. 1997. Data and Information Submission at the Virginia Coast LTER. Available on-line at: http://www.vcrlter.virginia.edu/data/submission.html

USGS. 2000. Metadata in plain language. Available on-line at: http://geology.usgs.gov/tools/metadata/tools/doc/ctc/

¹Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725. This work was sponsored by the U.S. National Aeronautics and Space Administration, Earth Science Data and Information Systems Project.