formats

DATA STRUCTURE AND FORMATS

System files consist of data files and accompanying metadata files, which contain necessary coded information to properly access, interpret, display, or analyze the data files. The GED file formats are compatible with the Idrisi Geographic Information System (GIS). Structure and format requirements of other systems will vary, necessitating conversion of both data and metadata files. Documentation here (or in the Idrisi for Windows User's Guide) describes the structure conventions used in the GED and the specific file formats.

Global Coordinates

Raster Grids

Nested Grids and Cell Registration

Raster Metadata Files

Vector Data Files

Vector Metadata Files

Attribute and Values Files

Attribute and Values Metadata Files

Color Palette Files

Reference System Parameter Files

Numerical Types.

Database Structure and Formats

GLOBAL COORDINATES

The global directory called GLGEO in the database consists of a collection of raster (cell) and vector (point, line or polygon) datasets (with file extensions ".IMG" and ".VEC," accordingly), each of which represents a global digital map (spatial distribution). All data are provided in a Cartesian orthonormal geodetic (Latitude/Longitude) mapping, also referred to as "Platte Carree" (and sometimes referred to as "geographic" or "unprojected").

The global grids and vector datasets in this database all have a common origin and geographic window. The western and eastern edges of the grid and vector datasets fall on the International Date Line, or 180 degrees East/West longitude (+/- 180 degrees). The Greenwich (Prime) meridian thus bisects the window in the east-west direction. Similarly, the northern and southern edges of the grid and vector datasets are located at the poles, or 90 degrees North latitude (+90) and 90 degrees South latitude (-90); and the equator bisects the map in the north-south direction. The boundaries for this global window are thus:

Maximum longitude (X) : +180 degrees (East)
Minimum longitude (X) : -180 degrees (West)
Maximum latitude (Y)  :  +90 degrees (North)
Minimum latitude (Y)  :  -90 degrees (South)

Such information appears in the header (.DOC and .DVC) files stored in the META directory for use by the GIS software (described below). Note that the geographic limits indicated in these header files refers to the bounding rectangle of the image/map, not the centroid of raster cells (as required in some GIS window specifications, e.g., GRASS), nor, in the case of vector data, does it refer to the actual geographic limits of the data.

By convention, the software provided with the database counts row and column coordinates starting with zero (0). Using this convention, the corner pixels of a ten minute dataset (such as the Global Vegetation Index) are:

                   (column,row)
North West corner: (0,0)
North East corner: (2159,0)
South West corner: (0,1079)
South East corner: (2159,1079)

Similarly, the corner pixels in a one degree grid are:

                   (column,row)
North West corner: (0,0)
North East corner: (359,0)
South West corner: (0,179)

South East corner: (359,179)

Database Structure and Formats

Raster Grid (image/map) Data Files

Raster grid data are provided in a basic "image" structure, with sequential values in the file corresponding to a row-wise structure, beginning with the cell at the upper-left (North-West) corner of the geographic area. In the case of data represented as Cartesian (latitude/longitude) grids in the GLGEO directory, this corresponds to the cell with its north-west corner on the International Date Line, at the North pole. Rows are filled with sequential values from the data file up to the maximum row size (specified in the metadata file, described below), repeating this pattern for each row from top (North) to bottom (South). Clearly, an artifact of this projection is that the extreme North and South rows are artificially expanded longitudinally. In reality, they represent pie-shaped cells in a circular area around the pole with a radius of 1 pixel height. At the equator, the latitude/longitude projection results in cells that are more nearly rectangular in their true ground distance. Such uneven representation across the globe is a strong argument for using other tessellations (spatial sampling schemes), however; to date, most global data have been produced in latitude/longitude coordinates. In future versions of the database, other projections may be employed, beginning with the polar regions.

The number of values in a raster data file will equal the number of rows multiplied by the number of columns. All raster data are stored as binary files, as one or two-byte integer data types or 4-byte IEEE floating point data type. The byte count for each file can be calculated by multiplying the number of values (Rows x Columns) by the number of bytes per value (1, 2, or 4). This sequencing of values in the "image" is shown in the following diagram (for a five column image):

`0`	`1`	`2`	`3`	`4`
`5`	`6`	`7`	`8`	`9`
`10`	`11`	`12`	`13`	`14`

Database Structure and Formats

Nested Grids and Cell Registration

A convention of commonly registered "nested" grids has been adopted whereby various grid sizes may be used, all of which are integer multiples or divisions of each other. The possible grid cell sizes in this scheme are 2-degree, 1 degree (60 minutes), 1/2 degree (30 minutes), 1/6 degree (10 minutes), and 1/12 degree (5-minutes), 1/60th degree (1-minute), and 1/120th degree (30-seconds). This system captures most commonly available grids, and allows intercomparison of datasets at any nested scale, without altering the numerical or classed values of the data or introducing geographic aliasing effects from "nearest neighbor" resampling. The analyst must understand, of course, that the comparison is between data of different scale, and that special considerations must be made in interpretation. Nevertheless, statistical comparisons between, for example, soil or vegetation classes at .5 degree, and similar classifications at 1 degree are useful for quality assessment purposes and to develop derivation or integration methods prior to using data in analysis or modeling. Similarly, comparisons between classed data and finer grid satellite data can be extremely useful for validation and quality control/assessment.

Data on other "non-nested" grids, as discussed earlier, present a special problem because they require re-sampling before they can be geographically compared to data on different grids. While there are many methods for this, they all introduce statistical issues that would best be evaluated by original investigators, after careful analysis. Examples of datasets which presented integration problems are provided. For example, methods may vary for comparing data on a 2-minute grid with data on a 5-minute grid. Or for comparing data centered on 1/2 degree latitude and longitude meridians, with similar thematic data, identically scaled but edge-matched on 1/2 degree meridians. Or for comparing data with rectangular 2.5 X 2 degree cells with regular grids at 1 degree spacing.

Different assumptions must be made in the process depending on the nature of the data. The necessary re-sampling for such comparisons will always involve greater uncertainty than re-producing the differently scaled or registered grid from original data.

It is clear that the problem of resampling for comparability is best solved by working with the principal investigators to re-produce the data on one of the recommended grids. This would ensure the ability to carefully document and defend resampling and interpolation methods. Nevertheless, this is not always practical and user feedback may be important for establishing priorities. While it is possible that aggregation of these data to coarser scales may reduce concern for overcoming such constraints in a given study that requires only coarser information and has developed adequate aggregation methods; the need for quality assessment requires intercomparison with other data at the full sampling resolution. This is a central issue in establishing "nested-grids," and it is important to know if the proposed grid convention can adequately deal with the problem. These issues may encourage data developers to support the nested grid structure, to facilitate intercomparison for quality assessment and validation purposes.

In cases where an original dataset does not extend to full global coverage, or has missing values, its grid was filled or padded with appropriate "no data" values to complete the global grid. Padding to a larger window than the original dataset is theoretically unnecessary for software systems that have full georeferencing (e.g., GRASS and Idrisi, although Idrisi 4.0 does not yet fully utilize it's georeference information). Nevertheless, padding allows the specification of more explicit "no data" flags, preventing ambiguity and simplifying overlay in non-georeferenced systems.

When re-gridding from fractional grid dimensions was required to represent the data within the nested-grid convention, the decision to aggregate or expand to one of the common grids was determined by knowledge of spatial variability. Simple area-weighted averaging (using the percentage of overlap of output cells with the originals) was used in some cases, if the spatial variability was known to be on the order of a pixel. This method is spatially un-biased but has the effect of "smoothing" the data. For example, the NGDC Monthly Generalized GVI was produced by aggregating in this way to 10-minutes cells from approximately 8.6-minute cells of the source "plate carreé" grids. An alternative method, given similar spatial uncertainty, is to re-sample by a "nearest-neighbor" method, which has the advantage of preserving the original data values, but at the cost of some aliasing and spatial distortion (thus implying that spatial accuracy is limited by the cell size). In other cases, where the statistical validity of such averaging cannot be assessed, a more conservative approach may be taken of re-sampling the data to a finer grid in the nested structure, to preserve the original variability or actual data values within the new grid. This allows intercomparison of the original and interpolated values with other data in the nested-grid structure, and may be a reversible process: The cost in this case is storage space for the replicated data. In all other cases, where no geo-referencing problems existed, the grid representation in the database reflects the original geographic sampling.

The cell size in a given data file is determined by the global window and the number of rows and columns (not the "resolution," which refers to the resolution of the original data sampling, not necessarily the current grid). Since cell sizes can vary between raster data layers according to the nested-grid convention (even multiples or divisions of cell size), all overlay and intercomparison operations must expand or contract the maps accordingly. At present, there are few systems which do this automatically (e.g., GRASS). In all other cases, the user must expand or contract the maps in a separate operation prior to overlay or intercomparison (see implementation section for specific operations in Idrisi and GRASS). The nested-grid convention allows this to be accomplished by simple pixel replication, without interpolation. While allowing direct intercomparison, this method does not change the original spatial definition of the data, which must always be taken into consideration. (Decimation -- i.e., choosing every nth pixel -- or aggregation -- e.g., averaging or reclassifying -- to coarser grids must be done more carefully to ensure valid geographic and numerical definitions.)

Database Structure and Formats

Raster Metadata Files

Each data file is accompanied by an ASCII metadata file which contains the ASCII header information necessary for the software to correctly interpret the corresponding data file. The following shows an example of a metadata file, with explanations on the right.

SAMPLE.DOC

file title : Sample data file header Name of the spatial data element

data type : byte Byte, integer, or real

file type : binary File storage method (ASCII or binary)

columns : 2160 Number of vertical columns in grid

rows : 1080 Number of horizontal rows in grid

ref. system : lat/long Georeference system (projection)

ref. units : deg Georeference basic unit of measure

unit dist. : 1.0000000 Georeference unit multiplier

min. X : 180.0000000 West-most geographic coordinate

max. X : 180.0000000 East-most geographic coordinate

min. Y : 90.0000000 South-most geographic coordinate

max. Y : 90.0000000 North-most geographic coordinate

pos'n error : 0.1666667 Spatial location uncertainty

resolution : 0.1666667 Sampling interval of ORIGINAL data

min. value : 0 Minimum data value in the dataset

max. value : 32 Maximum data value in the dataset

value units : scalars or classes Data values units of measurement

value error : estimate Estimated error of data values

flag value : 255 Special value to flag non-data

flag def'n : no data Definition of flag value

legend cats : 3 Number of legend categories

category 0 : Water Legend Category 0

category 1 : Podzol Soils Legend Category 1

category 2 : Brown Podsolic Soils etc.

Lineage : see documentation Lineage line (as many as needed)

Comment : notes as needed Comment line (as many as needed)

Database Structure and Formats

Vector Data Files

Vector data are provided in either ascii or binary file format. The ascii format is:

{ID#} {#Points}
xxx.xxxxxx yy.yyyyyy
xxx.xxxxxx yy.yyyyyy
{ID#} {#Points}
xxx.xxxxxx yy.yyyyyy
xxx.xxxxxx yy.yyyyyy
[etc. ]
0 0

The first line is a point, segment, or polygon label. In the label line, the first number (ID#) is a feature identification number, and the second number (#Points) is the number of coordinate pairs that follow to define the feature. On each line following the header are the coordinate pairs that make up the feature. The first number (xxxx) is the "X" coordinate (longitude) and the second (yyyy) is the "Y" coordinate (latitude). Each feature (point, line, or polygon) begins with a new label line, followed by its coordinate pairs for as many points as define the feature. The last line of each vector file consists of two zeros, separated by one or more spaces (this terminates the listing of features). Coordinates are in the same units as the bounding rectangle specified in the vector metadata (.DVC) file, with positive values indicating East (X) and North (Y), and negative values indicating West (X) and South (Y). In this format, point data files would have only one coordinate pair defining each feature (point), line data files can handle up to 750 coordinate pairs for each feature (this may have been increased - check your Idrisi manual), and an unlimited number of features. Polygon data files are similar to line data files except that each feature is closed (i.e., the starting point and ending point are identical). The units and reference system for coordinate values used in the vector files are defined in their corresponding metadata files.

Database Structure and Formats

Vector Metadata Files

file title : Sample vector file	Name of the spatial data element
id type : integer	Data type for feature labels (integer)
file type : binary	File storage method
object type : line	Feature (point, line or polygon)
ref. system : lat/long	Georeference system (projection)
ref. units : deg	Georeference basic unit of measure
unit dist. : 0.0166667	Georeference unit multiplier
min. X : 10800.0000000	Minimum x coordinate of study area
max. X : 10800.0000000	Maximum x coordinate of study area
min. Y : 5400.0000000	Minimum y coordinate of study area
max. Y : 5400.0000000	Maximum y coordinate of study area
pos'n error : unknown	Spatial uncertainty
resolution : 0.01666667	Mean point spacing (sample resolution)
comment : notes as needed4	Comment line (as many as needed)

SPECIAL NOTES:

(1) The bounding rectangle corresponds to the global window as described above (not the actual range of values in the file), in the spatial units specified.

(2) The vector data and metadata files are compatible with the Idrisi version 4.0 GIS and Idrisi for Windows 2.0, which are fully georeferenced.

Database Structure and Formats

Attribute and Values Files

Attribute files contain tabular data used for re-coding an existing spatial data element or attaching additional labels to coded features. They are used most commonly with vector datasets that define a set of boundary features for which there are numerous kinds of data (e.g., political boundaries with varying statistics for each polygon, or streams with varying measurements for each coded portion of the stream). Such data would be described as having only one spatial "distribution" but many attributes or data "variables". There is no technical reason that attribute files cannot also be attached to raster data files, where each data value has a unique translation in the attribute table. Attribute files, whether attached to vector or raster data elements of a dataset, are stored in either Attribute Database Table files (in a Database Management System format, or "DBMS") or Attribute Values files. Database files, which can be used by Idrisi's Database Workshop in Idrisi For Windows 2.0, are stored as Microsoft Access 2.0 format. The Attribute Values files may be two or more columns with the following structure:

1 570 Morrison etc...
2 860 Dakota Ridge
3 120 Blue Mountain
etc.

These "flat files" may contain variable width columns of attribute data up to a total record length of 255 characters. Each attribute column may contain integer values, real (floating point) values, or character strings.

Note: Attributes are stored as Attribute Database Tables in MS Access 2.0 format (using a .MDB file extensioin), however the table is also provided in flat ascii format in identically named .TXT files. The .DVL metadata files (described below) refer to the Database files (.MDBs).

Database Structure and Formats

Attribute and Values Metadata Files

Metadata files for attributes (either for Attribute Values Files or Attribute Database Table) use a .DVL file extension. The following is an Example .DVL file for a 2-column attribute values file:

file title : Example Attribute Doc.	Name of attribute data element
file type : ascii	File storage method (ASCII only)
records : 3	Number of values or codes in spatial file
fields : 2	Number of attribute fields
field 0 : identifiers	Feature ID field (matches spatial file)
data type : integer	Type of field (integer, real, string)
format : 0	Field format code (unused)
field 1 : biomass	Field definition
data type : integer	Type of field (integer, real, string)
format : 0	Field format code (unused)
min. value : 100	Minimum value in column (null=0)
max. value : 2532	Maximum value in column (null=0)
value units : gms./sq.m.	Units for field values
value error : unknown	Class or value uncertainty
flag value : 255	Special flags in field data
flag def'n : no data	Flag definition
field 2 : site names	Field definition
data type : string	Type of field (integer, real, string)
format : 0	Field format code (unused)
min. value : 0	Minimum value in column (null=0)
max. value : 0	Maximum value in column (null=0)
value units : classes	Units for field values
value error : unknown	Class or value uncertainty
flag value : none	Special flags in field data
flag def'n : none	Flag definition
[add'l fields up to 20 ]	[repeat previous 9 lines for each field
[ " ]	[up to a record length of 255 characters]
legend cats : 3	Number of legend categories
category 0 : class def. zero	Category 0 (alternative legend)
category 1 : class def. one	Category 1 (alternative legend)
category 2 : class def. two	Category 2 (alternative legend)
comment : notes as needed6	Comment line (as many as needed)

Database Structure and Formats

Color Palette Files

Color palette files (.PAL and .SMP) contain color code assignment metadata which can be used when displaying an image or vector file. The .SMP files are in binary format and are used by Idrisi's Display module. For convenience of the user, an similar .PAL file in ascii format is included. Red, Green and Blue color intensities are specified by these files according to numerial code, to a precision of 1 in 64.

0        0       0       0      {solid black}
1       23      43      63      
2       63      63      63      {full white}
etc.

Database Structure and Formats

Reference System Parameter Files

These files (.REF) record metadata about the specific geographic reference system used for the dataset. They include data on their projection, elipsoid, datum, and numbering conventions. These files can be used by the Idrisi PROJECT module.

Database Structure and Formats

Numerical Types

Raster data files are stored using either byte binary, integer binary, or real (floating point) binary numbers. Byte data are coded in 8 binary bits, thus allowing numbers from 0 to 255 (decimal) to be stored. Integer data are coded as two 8-bit bytes (16 bits), in "little-endian" format, i.e., with the most significant byte on the right. Since one bit is used for the algebraic sign, numbers from -32,768 to +32,767 (decimal) can be stored as integer data. The order of the two bytes used for integer data conforms to conventions used by IBM (DOS) and DEC. Some systems (e.g., GRASS/UNIX) require data in "big-endian" format, with the most significant byte on the left (i.e., the order of the data must be reversed, or "byte-swapped," to be correctly interpreted), and representation of negative numbers may also vary between systems (see SYSTEM COMPATIBILITY). Binary real data types (floating point), because of compatibility issues, are used in the database only where it is impossible to represent the full range of values in the original dataset with integers. An IEEE standard 4-byte real data type is used (this may not be readable on all machines, but is common in DOS). The numerical type for each data file is indicated in their associated metadata files, which are stored in ASCII format. Vector data files (and their associated metadata files) are all stored in ASCII format.

The ASCII format on the Version 1 CD-ROMs uses the standard DOS convention for terminating a line with two bytes, a carriage return (hexadecimal 0D) followed by a line feed (hexadecimal 0A). In DOS, ASCII files are commonly ended with a DOS end-of-file mark (hexadecimal 1A), but this EOF mark is not required by most software, including that provided with the online version, and is thus not used to avoid incompatibilities with other systems. The binary data files (.IMG) also do not have an EOF mark. File size is thus an exact byte count of the data.

Return to top of page

file title : Sample data file header	Name of the spatial data element
data type : byte	Byte, integer, or real
file type : binary	File storage method (ASCII or binary)
columns : 2160	Number of vertical columns in grid
rows : 1080	Number of horizontal rows in grid
ref. system : lat/long	Georeference system (projection)
ref. units : deg	Georeference basic unit of measure
unit dist. : 1.0000000	Georeference unit multiplier
min. X : 180.0000000	West-most geographic coordinate
max. X : 180.0000000	East-most geographic coordinate
min. Y : 90.0000000	South-most geographic coordinate
max. Y : 90.0000000	North-most geographic coordinate
pos'n error : 0.1666667	Spatial location uncertainty
resolution : 0.1666667	Sampling interval of ORIGINAL data
min. value : 0	Minimum data value in the dataset
max. value : 32	Maximum data value in the dataset
value units : scalars or classes	Data values units of measurement
value error : estimate	Estimated error of data values
flag value : 255	Special value to flag non-data
flag def'n : no data	Definition of flag value
legend cats : 3	Number of legend categories
category 0 : Water	Legend Category 0
category 1 : Podzol Soils	Legend Category 1
category 2 : Brown Podsolic Soils	etc.
Lineage : see documentation	Lineage line (as many as needed)
Comment : notes as needed	Comment line (as many as needed)