Best Practices for Preparing Environmental Data Sets to Share and Archive

(Previously entitled "Best Practices for Preparing Ecological and Ground-Based Data Sets to Share and Archive," Cook et al., 2001)

Updated by L. A. Hook, T. W. Beaty, S. Santhana-Vannan, L. Baskaran, and R. B. Cook. June 2007.

Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A.

Environmental Sciences Division, Oak Ridge National Laboratory1

At the request of several field researchers, investigators, GIS and image specialists, and data managers, the following guidelines have been prepared for the data management practices that data collectors should follow to improve the usability of their data sets. This guidance is provided for those who perform environmental measurements, although many of the practices may be useful for other data collection and archiving activities.

We assembled what we feel are the most important practices that researchers could implement to make their data sets ready to share with other researchers. These practices could be performed at any time during the preparation of the data set, but we suggest that researchers consider them before measurements are taken. The order of the practices is not necessarily sequential, as a researcher could provide draft data set metadata before any measurements are taken.

The seven best practices are:

  1. Assign Descriptive File Names
  2. Use Consistent and Stable File Formats for Tabular and Image Data
  3. Define the Contents of Your Data Files
  4. Use Consistent Data Organization
  5. Perform Basic Quality Assurance
  6. Assign Descriptive Data Set Titles
  7. Provide Documentation

 

1. Assign Descriptive File Names

File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type. The file name should be provided in the documentation (described in Sect. 7) and in the first line of the header rows in the file itself.

Clear, descriptive, and unique file names may be important later when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. Avoid using file names such as mydata.dat or 1998.dat.

File names should be constructed for easy management by various data systems. Names should contain only numbers, letters, dashes, and underscores -- no spaces or special characters. Also, in general, lower-case names are less software and platform dependent and are preferred. When choosing a file name, check for any database management limitations on the use of special characters and file name length. For practical reasons of legibility and usability, file names should not be more than 64 characters in length and if well constructed could be considerably less.

You may want to use similar logic when designing directory structures and names. Also, the data set title (see Sect. 6) should be similar to the data file name(s).

Version Number:  Including a data file creation date or version number enables data users to quickly determine which data they are using if an update to the data set is released (e.g., *_v1.csv, *_r1.csv, or *_20070227.csv).

File Type or Extensions:  Use *.txt, *.csv generally for tabular data. Section 2 addresses formats and extensions for image data files.

File Compression: Use *.zip, *.gz, or *.tar file extensions, as appropriate for the compression software.  The individual files may be compressed for space conservation or several files may be aggregated and then compressed as one file of reduced size. When multiple files are compressed together, the same file naming guidelines apply to the compressed collection of files.

Example Data File Names:

 

2. Use Consistent and Stable File Formats For Tabular and Image Data

 

In choosing a file format, data collectors should select a consistent format that can be read well into the future and is independent of changes in applications.

Tabular Data:

Using ASCII file formats is the best way to ensure that measurement data are readable in the future.

At the top of the file, include several header rows.

Within the ASCII file, follow these guidelines.

In the data set documentation, specifically add the following data file information.

Image (Raster) Data:

Some field researchers may generate Image (Raster) data sets. Below are some guidelines/recommendations for archiving these types of data files.

Suggested Non-Proprietary File Formats: (Listed in order of our preference. See file extension reference, Appendix A.)

If you cannot use any of the above formats, another option is to use any non-proprietary public domain data format. Whatever file format you use, be sure to thoroughly document the format and follow the suggested guidelines.

Guidelines for documenting image data files:

Proprietary Software Data Formats:

 

Data that are provided in a proprietary software format must include documentation of the software specifications (i.e., Software Package, Version, Vendor, and native platform). The archive data center will use this information to convert to a non-proprietary format for the archive.

 

All files types, that constitute the complete geographic data format documentation, must be provided for the specific software package. For example:

 

Image (Vector) Data: 

Below are suggested vector file formats. These are mostly proprietary data formats; please be sure to document the Software Package, Version, Vendor, and native platform.

Also make sure that the vectors are properly geo-referenced and the geometry type (Point, Line, Polygon, Multipoint etc ) is specified.

 

File Extension Reference Table

A table of common file extensions and their generally accepted formats are described in Appendix A.

 

On-line Resources:

 

3. Define the Contents of Your Data Files

In order for others to use your data, they must fully understand the contents of the data set, including the parameter names, units of measure, formats, and definitions of coded values. Provide the English language translation of any data values and descriptors that are in another language (e.g., coded fields, variable classes, and GIS coverage attributes).

Parameter Name:  The parameters reported in the data set need to have names that describe the contents. The documentation should contain a full description of the parameter. Use commonly accepted parameter names, for example, Temp for temperature, Precip for precipitation, and Lat and Long for latitude and longitude. See the online references in the Bibliography for additional examples. Also, be sure to use consistent capitalization (not temp, Temp, and TEMP) and use only letters, numerals, and underscores in the parameter name.

Units:  The units of reported parameters need to be explicitly stated in the data file and in the documentation. We recommend SI units but recognize that each discipline may have its own commonly used units of measure. The critical aspect here is that the units be defined in the documentation so that others understand what is reported.

Formats:  Within each data set, choose a format for each parameter, explain the format in the documentation, and use that format throughout the data set. Consistent formats are particularly important for dates, times, and spatial coordinates. For numeric parameters, if the number of decimal places should be preserved to indicate significant digits, then explicitly define the format such that users may take precautions to ensure that significant figures are not lost or gained during data transformations.

We recommend the following formats for common parameters:

Dates: yyyy-mm-dd or yyyymmdd, e.g., January 2, 1997 is 19970102.

Time: Use 24-hour notation (13:30 hrs instead of 1:30 p.m. and 04:30 instead of 4:30 a.m.). Report in both local time and Coordinated Universal Time (UTC). Include local time zone in a separate field. As appropriate, both the begin time and end time should be reported in both local and UTC time. Because UTC and local time may be on different days, we suggest that dates be given for each time reported. Applicable data and time standards are listed in Appendix B.

Spatial Coordinates: Spatial coordinates should be recorded in decimal degrees format to at least 4 (preferably 5 or 6) significant digits past the decimal point. Provide latitude and longitude with south latitude and west longitude recorded as negative values, e.g., 80 30' 00" W longitude is is -80.5000. Make sure that all location information in a file uses the same coordinate system, including coordinate type, datum, and spheroid. Document all three of these characteristics (e.g., Lat/Long decimal degrees, NAD83 (North American Datum of 1983), WGRS84 (World Geographic Reference System of 1984)). Mixing coordinate systems [e.g., NAD83 and NAD27 (North American Datum of 1927)] will cause errors in any geographic analysis of the data. Applicable spatial coordinate standards are listed in Appendix C.

Elevation: Provide elevation in meters. Include detailed information on the vertical datum used (e.g.- North American Vertical Datum 1988 (NAVD 1988) or Australian Height Datum (AHD)). Additional information on vertical datum are include in Appendix D.

 

Coded Fields: 

Coded fields, as opposed to free text fields, often have standardized lists of predefined values from which the data provider may choose.  Two good examples are U.S. state abbreviations and postal zip codes.  Data collectors may establish their own coded fields with defined values to be consistently used across several data files. The use of consistent sampling site designations is a good application. Coded fields are more efficient for storage and retrieval of data than free text fields.

Guidance for two specific coded fields commonly used in environmental data files:

Data Flag or Qualifying Values:  A separate field with specified values may be used to provide additional information about the measured data value including, for example, quality considerations, reasons for missing values, or indicating replicated samples. Codes should not be parameter specific but should be consistent across parameters and data files. Definitions of flag codes should be included in the accompanying data set documentation.

  Example documentation of Data Quality Flag values:

Flag Value

Description

V0

Valid value

V1

Valid value but comprised wholly or partially of below detection limit data

V2

Valid estimated value

V3

Valid interpolated value

V4

Valid value despite failing to meet some QC or statistical criteria

V5

Valid value but qualified because of possible contamination (e.g., pollution source, laboratory contamination source)

V6

Valid value but qualified due to non-standard sampling conditions (e.g., instrument malfunction, sample handling)

V7

Valid value but set equal to the detection limit (DL) because the measured value was below the DL

M1

Missing value because no value is available

M2

Missing value because invalidated by data originator

H1

Historical data that have not been assessed or validated

Units:  While data collectors can generally agree on the units for reporting measured parameters, the exact syntax of the units designation varies widely among programs, projects, scientific communities, and investigators (if standardized at all).  If a shorthand notation is reported in the data file, the complete units should be spelled out in the documentation so that others can understand and interpret your representation of subscripts, superscripts, area, time intervals, etc.

 

Missing Values:  Use a specified extreme value not likely to ever be confused with a measured value (e.g., -9999).  Consistently use the same notation for each missing value in the data file.

 

Typical Parameter Documentation:

The following text describes the parameters in a data set; this type of description should be included in the data set documentation.

Data File Contents:  (kt_tree_data.csv)  The files are in comma-delimited ASCII format, with the first line listing the data set, author, and date. The data records follow and are described in the table below. A value of -9.99 indicates no data.

Column Description Units/Format
SITE k=Kataba forest, p=Pandamatenga, m=Near Maun, e=HOORC/MPG Maun tower, o=Okwa river crossing, t=Tshane, skukuza=Skukuza Flux Tower text
SPECIES Scientific name up to 25 characters text
DATE Date of measurement yyyymmdd
BA Woody plant basal area m2/ha
SEBA Standard error of BA m2/ha
DENSITY Woody plant density (number of trees per hectare) number/ha
SEDEN Standard error of DENSITY (n=42 for KT, n=49 for Skukuza) number/ha
STEMS Number of stems per hectare (/ha) number/ha
HEIGHT Basal area-weighted average height m2/ha
WOOD Aboveground woody plant wood dry biomass kg/ha
LEAF Aboveground woody plant leaf dry biomass kg/ha
LAI Leaf Area Index calculated by allometry m2/m2

[ Adapted from Scholes, R. J. 2005. SAFARI 2000 Woody Vegetation Characteristics of Kalahari and Skukuza Sites. Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, Oak Ridge, Tennessee, U.S.A. ]

 

Data File Contents: (NARSTO_EPA_SS_HOUSTON_FRASER_ORG_SPEC_24HR_V1.txt)

COLUMN NAME

NAME TYPE

CAS IDENTIFIER

UNITS

FORMAT TYPE

FORMAT FOR DISPLAY

MISSING CODE

OBSERVATION TYPE

SAMPLE PREPARATION

BLANK CORRECTION

Site ID: standard

Variable

None

None Char 12 None

Supplementary data

Not applicable

Not applicable

Date start: local time

Variable

None

yyyy/mm/dd

Date

10

None

Supplementary data

Not applicable

Not applicable

Time start: local time

Variable

None

hh:mm

Time

5

None

Supplementary data

Not applicable

Not applicable

Date end: local time

Variable

None

yyyy/mm/dd

Date

10

None

Supplementary data

Not applicable

Not applicable

Time end: local time

Variable

None

hh:mm

Time

5

None

Supplementary data

Not applicable

Not applicable

Time zone: local

Variable

None

None

Char

3

None

Supplementary data

Not applicable

Not applicable

Date start: UTC

Variable

None

yyyy/mm/dd

Date

10

None

Supplementary data

Not applicable

Not applicable

Time start: UTC

Variable

None

hh:mm

Time

5

None

Supplementary data

Not applicable

Not applicable

Date end: UTC

Variable

None

yyyy/mm/dd

Date

10

None

Supplementary data

Not applicable

Not applicable

Time end: UTC

Variable

None

hh:mm

Time

5

None

Supplementary data

Not applicable

Not applicable

Fluoranthene

Variable

C206-44-0

ng/m3 (nanogram per cubic meter)

Decimal

8.2

-999.99

Particles

Organic extraction

Blank corrected

Fluoranthene

Flag

C206-44-0

None

Char

2

None

Particles

Organic extraction

Blank corrected

Pyrene

Variable

C129-00-0

ng/m3 (nanogram per cubic meter)

Decimal

8.2

-999.99

Particles

Organic extraction

Blank corrected

Pyrene

Flag

C129-00-0

None

Char

2

None

Particles

Organic extraction

Blank corrected

Benz[a]anthracene

Variable

C56-55-3

ng/m3 (nanogram per cubic meter)

Decimal

8.2

-999.99

Particles

Organic extraction

Blank corrected

Benz[a]anthracene

Flag

C56-55-3

None

Char

2

None

Particles

Organic extraction

Blank corrected

[ Adapted from Fraser, Matthew. 2003. NARSTO EPA_SS_HOUSTON TEXAQS2000 PM2.5 Organic Speciation Data. Available on-line (http://eosweb.larc.nasa.gov/PRODOCS/narsto/table_narsto.html) at the Langley DAAC, Hampton, Virginia, U.S.A. ]

 

4. Use Consistent Data Organization

We recommend that you organize the data within a file in one of two ways. Whichever style you use, be sure to place each observation in a separate line (row). Most often each row in a file represents a complete record, and the columns represent all the parameters that make up the record. This arrangement is similar to a spreadsheet or matrix. For example:

 
SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000
SITE,COUNTRY,LAT,LONG,DATE,START_DEPTH,END_DEPTH,CHARACTERISTICS,C,N,d13C,d15N
units,none,decimal degrees,decimal degrees,yyyy/mm/dd,cm,cm,none,percent,percent,per mil,per mil
USGS-1,Botswana,-21.62,27.37,1999/07/12,5,20,Hardveld,0.67,0.052,-17,8.9
USGS-2,Botswana,-21.07,27.42,1999/07/12,5,20,Hardveld,0.68,0.063,-18.3,8
USGS-3,Botswana,-20.72,26.83,1999/07/12,5,20,Hardveld,0.94,0.087,-17,6.8
USGS-4,Botswana,-20.52,26.41,1999/07/12,5,20,Hardveld,0.53,0.04,-19.9,5.5
USGS-5,Botswana,-20.55,26.15,1999/07/12,5,20,Lacustrine,2.11,0.162,-15.2,5.9
...
USGS-30,Botswana,-19.81,23.63,1999/07/18,5,20,Alluvium,0.67,0.063,-19.2,11.8
USGS-31,Botswana,-20.62,22.74,1999/07/18,5,20,Hardveld,0.23,0.014,-16.8,16.2
USGS-32,Botswana,-21.06,22.4,1999/07/18,5,20,Hardveld,0.39,0.028,-20.9,9.5
USGS-33,Botswana,-22.01,21.37,1999/07/19,5,20,Sandveld,0.19,0.01,-17.9,9.1
USGS-34,Botswana,-22.99,22.18,1999/07/19,5,20,Sandveld,0.16,0.006,-19.7,8.7
USGS-35,Botswana,-23.7,22.8,1999/07/19,5,20,Sandveld,0.37,0.019,-20.7,15.2
[ From:  Aranibar, J. N., and S. A. Macko. 2005. SAFARI 2000 Plant and Soil C and N Isotopes, Southern Africa, 1995-2000. 
Data set. Available on-line [http://daac.ornl.gov/] from Oak Ridge National Laboratory Distributed Active Archive Center, 
Oak Ridge, Tennessee, U.S.A. ]

 

If you use a coded value or abbreviation for a site or station, be sure to provide a definition, including spatial coordinates, in the documentation.

A second arrangement may be more efficient when most records do not have measurements for most parameters, that is, a very sparse matrix of data, with many missing values. In this arrangement, one column is used to define the parameter and another column is used for the value of the parameter. Other columns may be used for site, date, treatment, units of measure, etc. For example:

 

 
  Coast redwood NPP data from Humboldt Redwoods State Park, California, USA; Busing & Fujimori, June 2005 
  Old stand plot study at Bull Creek with bole diameter measurements at 1.7 m aboveground in 1972 and 2001 
Orig_sort_order Parameter Measurement_Type Value Units Species Sequoia_sp_grav Equation
1 Latitude Site Characteristics 40.35 decimal degree Not applicable -999.9 Not applicable
2 Longitude Site Characteristics -123 decimal degree Not applicable -999.9 Not applicable
3 Terrain Site Characteristics Alluvial flat Not applicable Not applicable -999.9 Not applicable
4 Slope Site Characteristics 0 degree Not applicable -999.9 Not applicable
5 Elevation (above mean sea level) Site Characteristics 80 m (meter) Not applicable -999.9 Not applicable
6 Total site area  Site Characteristics 1.44 ha (hectare) Not applicable -999.9 Not applicable
7 Density Density 380 stems/ha (stems per hectare) All species -999.9 Not applicable
8 Basal area Area 330 m2/ha (square meter per hectare) All species -999.9 Not applicable
9 Basal area Area 329 m2/ha (square meter per hectare) Sequoia -999.9 Not applicable
...              
123 Total tree ANPP ANPP 581-697 g/m2/yr (gram per square meter per year) All species 0.33  eq. 2 estimates
124 Total tree ANPP ANPP 669-802 g/m2/yr (gram per square meter per year) All species 0.38  eq. 2 estimates
               
Sequoia_sp_grav:  *Specific gravity, 0.33 mg/cm3, see WE Westman & RH Whittaker, 1975, J. Ecol. for details. 
Sequoia_sp_grav:  ^Specific gravity, 0.38 mg/cm3, from DW Green et al., 1999, USDA Forest Service FPL-GTR-113. 
Method:  **Calculations & allometric equations described by RT Busing & T Fujimori, 2005, Plant Ecol. 
Notes:  ***Range of values results from min. & max. estimation ratios of WE Westman & RH Whittaker, 1975, J. Ecol.
 
From:  Busing, R. T., and T. Fujimori. 2005. NPP Temperate Forest: Humboldt Redwoods State Park, California, U.S.A., 1972-2001. 
Data set. Available on-line [http://www.daac.ornl.gov] from Oak Ridge National Laboratory Distributed Active Archive Center, 
Oak Ridge, Tennessee, U.S.A.
 
 
Keep Similar Information Together
An important issue with data organization is the number of records in each file (file size). There are a number of factors that determine the optimal number of records in a file, and we don't have any hard and fast rules. In general, keep a set of similar measurements together (e.g., same investigator, methods, time basis, and instruments) in one data file. Please do not break up your data into many small files, e.g., by month or by site if you are working with several months or sites. Instead, make month or site a parameter and have all the data in one large file. Researchers who later use your relatively large data file won't have to process many small files individually. There is an upper limit to the size of files, though. Large files (on the order of several tens of thousands of records, or several tens of megabytes) do become unwieldy and may be too large for some applications. For example, Excel 2003 will currently only support 65,000 rows and 256 columns of data. (Excel 2007 may eliminate these limitations.)  Large tabular data files may need to be broken into logical smaller files.

Organization by Data Type

If you are collecting many observations of several different types of measurements at a site (e.g., leaf area index and above- and belowground biomass), place each type of measurement in a separate data file. For each data file, use similar data organization, parameter formats, and common site names, so that users understand the interrelationships between data files.

Data types collected on different time bases (e.g., per hour, per day, per year) might be handled more efficiently in separate files.

Alternatively, if relatively few observations are made at a site for a suite of parameters, then all data could be placed in one file. Thorough data set documentation would be needed.

5. Perform Basic Quality Assurance

In addition to scientific quality assurance (QA), we suggest that you perform basic data QA on the data files. These checks complement the Tabular and Image file preparation guidance provided in Section 2.

Tabular Data

Image Vector and Raster Data

For GIS image/vector files, ensure that the projection parameters have been accurately given. Additional information such as data type, scale, corner coordinates, missing data value, size of image, number of bands, endian type should be checked for accuracy.

A checklist with suggested reviews for spatial data file attributes and accompanying documentation is included in Appendix E.

 

6. Assign Descriptive Data Set Titles

We recommend that data set titles be as descriptive as possible. When giving titles to your data sets and associated documentation, please be aware that these data sets may be accessed many years in the future by people who will be unaware of the details of the project.

Data set titles should contain the type of data and other information such as the date range, the location, and the instruments used. If your data set is part of a larger field project, you may want to add that name, too (e.g., SAFARI 2000 or LBA-ECO). In addition, we recommend that the length of the title be restricted to 80 characters (spaces included) to be compatible with other clearinghouses of ecological and global change data collections. Names should contain only numbers, letters, dashes, underscores and spaces -- no special characters. The data set title should be similar to the name(s) of the data file(s) in the data set (see Sect. 1). A given data set might contain only one data file or many thousands of data files.

Some bad titles:

Some great titles:

7. Provide Data Set Documentation

Characteristics:

The documentation accompanying your data set should be written for a user 20 years into the future. Therefore you should consider what that investigator needs to know to use your data. Write the document for a user who is unfamiliar with your project, sites, methods, or observations.

To ensure that documentation can be read well into the future requires that it be in a stable non-proprietary format. If figures, maps, equations, or pictures need to be included, use a non-proprietary document format such as html (hypertext markup language). Images, figures, and pictures may be included as individual gif (graphics interchange format) or jpg (Joint Photographic Experts Group) files. Converting documents to a stable proprietary format, such Adobe pdf (portable document format) files, is a good choice.

The documentation should be in a separate file that is identified in the data file. The name of the documentation file should be similar to the name of the data set.

The data set documentation should provide the following information:

Documentation can never be too complete.

Bibliography

Kanciruk, P., R.J. Olson, and R.A. McCord. 1986. Quality Control in Research Databases: The US Environmental Protection Agency National Surface Water Survey Experience. In: W.K. Michener (ed.). Research Data Management in the Ecological Sciences. The Belle W. Baruch Library in Marine Science, No. 16, 193-207.

Michener, W. K., J. W. Brunt, J. Helly, T. B. Kirchner, and S. G. Stafford. 1997. Non-Geospatial Metadata for Ecology. Ecological Applications. 7:330-342.

Michener, W.K. and J.W. Brunt (ed.). 2000. Ecological Data: Design, Management and Processing, Methods in Ecology, Blackwell Science. 180p.

Michener, W K. 2006. Meta-information concepts for ecological data management. Ecological Informatics. 1:3-7.

Cook, Robert B, Richard J. Olson, Paul Kanciruk, and Leslie A. Hook.  2001.  Best Practices for Preparing Ecological Data Sets to Share and Archive.  Bulletin of the Ecological Society of America, Vol. 82, No. 2, April 2001.

Ball, C. A., G. Sherlock, and A. Brazma. 2004. Funding high-throughput data sharing. Nature Biotechnology  22:1179-1183. doi:10.1038/nbt0904-1179.

USGS. 2000. Metadata in plain language. Available on-line at: http://geology.usgs.gov/tools/metadata/tools/doc/ctc/

Christensen, S. W. and L. A. Hook. 2007. NARSTO Data and Metadata Reporting Guidance. Provides reference tables of chemical, physical, and metadata variable names for atmospheric measurements. Available on-line at: http://cdiac.ornl.gov/programs/NARSTO/

U.S. EPA. 2007. Environmental Protection Agency Substance Registry System (SRS). SRS provides information on substances and organisms and how they are represented in the EPA information systems. Available on-line at: http://www.epa.gov/srs/

Olsen, L.M., G. Major, K. Shein, J. Scialdone, R. Vogel, S. Leicester, H. Weir, S. Ritz, T. Stevens, M. Meaux, C.Solomon, R. Bilodeau, M. Holland, T. Northcutt, and R. A. Restrepo. 2007. NASA/Global Change Master Directory (GCMD) Earth Science Keywords. Version 6.0.0.0.0. Available on-line at: http://gcmd.gsfc.nasa.gov/Resources/valids/archives/keyword_list.html

 

1Oak Ridge National Laboratory is operated by UT-Battelle, LLC, for the U.S. Department of Energy under contract DE-AC05-00OR22725. This work was sponsored by the U.S. National Aeronautics and Space Administration, Earth Science Data and Information Systems Project.

 

 

Appendix A

Suggested image and GIS data file formats suitable for long-term archiving.

File Extension Reference Table

File
Extension
File Format Description

 

 

.asc

 ASCII Text
.avl
 ESRI ArcView 3.x legend file
.bil
 band interleaved by line raster image file

.bip

 band interleaved by pixel raster image file

.blw

 world file for .bil image file

.bmp

 Windows OS/2 Bitmap Graphics
.bpw
 world file for .bip/.bmp image file

.bqw

 world file for .bsq image file

.bqw

 world file for .bsq image file

.bsq

 band sequential raster image file
.csv comma-separated values
.dbf  ARCVIEW shape file dbase tabular data file
.doc
 Microsoft Word document
.e00  ESRI Arc/Info export file
.evf  ENVI vector file

.flt

 Binary Floating Point file - Similar to ASCII grid file with Float data values
.hdf

HDF is a physical file format for storing scientific data. It features a collection of tools for writing, manipulating, viewing, and analyzing data across diverse computing platforms.

 

HDF-EOS supports three geospatial data types (grid, point, and swath), providing uniform access to diverse data types in a geospatial context. The HDF-EOS software library allows a user to query or subset the contents of a file by earth coordinates and time (if there is a spatial dimension in the data). Tools that process standard HDF files will also read HDF-EOS files; however, standard HDF library calls cannot access geolocation data, time data, and product metadata as easily as with HDF-EOS library calls.  http://www.hdfeos.org/index.php

.hdr  ENVI software image file data format documentation

.img

 Raster Image (Many format types)

.gif

 Graphics Interchange Format
.gz
 file compressed using the Gnu GZIP algorithm. Compressed files will typically have the .gz appended to the original file extension.
.jpg (.jpeg)
 Joint Photographic Experts Group raster image

.nc

 NetCDF (network Common Data Form) [ http://www.unidata.ucar.edu/software/netcdf/ ]
.png
 Portable Network Graphic raster image ( http://www.libpng.org/pub/png/ )
.ppt
 Microsoft PowerPoint presentation
.pdf
 Adobe Acrobat document
.prj  ARCVIEW shape file projection information, which is a text file that you can read.
.rdc  idrisi software image file data format documentation
.rrd  IMAGINE software image file data format documentation
.rst  idrisi software image file data format documentation
.sbn  ARCVIEW shape file spatial index for read-write shapefiles
.sbx  ARCVIEW shape file spatial index for read-write shapefiles.
.shp  ARCVIEW shape file feature geometry
.shx  ARCVIEW shape file lookup index
.tfw  TIF world file of projection information
.tif (.tiff)
 Tagged Image File Format raster image

.tiff

 GeoTIFF -- Geographic tagged image file format  [ http://www.remotesensing.org/geotiff/geotiff.html ]
.txt
 Text file

 

 

Appendix B

 

Applicable data and time standards suitable for long-term archiving of environmental data.

Applicable Date and Time Standards

 

The ISO 8601 international standard date notation is YYYY-MM-DD:

where YYYY is the year in the usual Gregorian calendar, MM is the month of the year between 01 (January) and 12 (December), and DD is the day of the month between 01 and 31.

For example, the fourth day of February in the year 1995 is written in the standard notation as 1995-02-04

The hyphens can be omitted if compactness of the representation is more important than human readability, for example as in 19950204

If only the month or even only the year is of interest:  1995-02 or 1995

ISO 8601 uses the 24-hour clock system that is used by most of the world.

The basic format is [hh][mm][ss] and the extended format is [hh]:[mm]:[ss]. [hh] refers to a zero-padded hour between 00 and 24, where 24 is only used to notate the midnight at the end of a calendar date. [mm] refers to a minute between 00 and 59. [ss] refers to a second between 00 and 59. So a time might appear as "13:47:30" or "134730".

Fractions may also be used with any of the three time elements. These are indicated by using the decimal point. A fraction may only refer to the most precise component of a time representation – that is, to denote "14 hours, 30 and one half minutes", do not include a seconds figure. Represent it as "14:30.5" or "1430.5".

Midnight is a special case and can be referred to as both "00:00" and "24:00". A time of  "00:00" is used at the beginning of the day, and is the most frequently used notation.

References

ISO 8601:2004, Data elements and interchange formats—Information interchange—Representation of dates and times. 

ISO publications are available from the International Organization for Standardization, <http://www.iso.ch/iso/en/prods-services/ISOstore/store.html>.

Summarized in Wikipedia:  http://en.wikipedia.org/wiki/ISO_8601

Summarized in Wikipedia: http://en.wikipedia.org/wiki/Coordinated_Universal_Time

 

 

Appendix C

Applicable spatial coordinate standards suitable for long-term archiving of image and GIS data.

Applicable Spatial Coordinate Standards

Global Positioning System derived coordinates may use additional reference datum:

ETRS89 (European Terrestrial Reference System 1989)
WGS84 (World Geodetic System 1984)
WGS84 (G730) (World Geodetic System 1984, upgrade G730)
WGS84 (G873) (World Geodetic System 1984, upgrade G873)

Applicable Standards

FGDC Spatial Data Transfer Standard (SDTS), Part 6: Point Profile, FGDC-STD-002.6. [http://www.fgdc.gov/standards/projects/FGDC-standards-projects/index_html ]

ISO DIS 6709, Standard Representation of Geographic Point Location by Coordinates.

ISO publications are available from the International Organization for Standardization, <http://www.iso.ch/iso/en/prods-services/ISOstore/store.html>.

Summarized in Wikipedia: http://en.wikipedia.org/wiki/ISO_6709

 

 

Appendix D

Additional information for reporting elevations.

Additional Information on Vertical Datum

Vertical Datum
Vertical datums are a considerable challenge for cartographers in the marine world. Ultimately all datasets should refer all depths to WGS84 Datum (or equivalent) to create a seamless database. This is relatively straightforward for land data as geoidal models can be used to derive the separation between local land datum and a global reference surface. However, Chart Datum, to which all soundings are referred, is not a coherent surface. It is certainly not easy to model.  (
www.hydrographicsociety.org/PDF/Journal-113-Article2.pdf )

The National Geodetic Survey (NGS) develops and maintains the current national geodetic vertical datum, NAVD 88. In addition, NGS provides the relationships between past and current geodetic vertical datums, e.g., NGVD 29 and NAVD 88. However, another part of our parent organization, NOS (National Ocean Service), is the Center for Operational Oceanographic Products and Services (CO-OPS). CO-OPS publishes tidal bench mark information and the relationship between NAVD 88 and various water level/tidal datums (e.g., Mean Lower Low Water, Mean High Water, Mean Tide Level, etc.). (http://www.ngs.noaa.gov/faq.shtml)

 

 

Appendix E

Following is a checklist for the quality assurance of image vector and raster data with suggested reviews for data file attributes and accompanying documentation.

For GIS image/vector files, ensure that the projection parameters have been accurately given. Additional information such as data type, scale, corner coordinates, missing data value, size of image, number of bands, endian type should be checked for accuracy.

Checklist for Image Vector and Raster Data File Characteristics:

  • File size
    • Are the n(rows) * n(cols) * n(bands) * (bytes per pixel) equal to file size? -  for binary data
  • Data format
    • Is the data in the format indicated by the file extension and documentation?
    • Is the data readable in standard GIS/Image processing software (e.g., ENVI, Erdas IMAGINE, or ArcGIS)?
    • Data format specific issues:
      • Is the header information accurate for HDF files?
      • Are ASCII file values expressed as plain numbers (e.g.,0.000222, not scientific notation)?
    • For multi band images, are the number of bands in the data equal to that specified in the documentation?
    • Additional documentation required for data format, i.e., the file is not self-documenting.
  • Georeferencing Information
    • Projection
      • Is projection information provided?
      • If provided, is it correct? (check with other accurately projected data)
    • Spatial extent
      • Does the data extent match the documentation?
      • Does the data extent match the number of pixels and resolution of the data?
    • Spatial resolution
      • Is the resolution specified?
      • Are the resolution units (meters, feet, decimal degrees etc.) correct?

Checklist for Image Vector and Raster Data File Content:

  • Data Values
    • Does the data range match that given in the documentation?
    • Does the data read from upper left to lower right or lower left to upper right or any other way?
  • Temporal resolution
    • Is the data provided for the time frame specified in the documentation?
    • Are the temporal units correct?