Database Structure and Formats
The global grids and vector datasets in this database all have a common origin and geographic window. The western and eastern edges of the grid and vector datasets fall on the International Date Line, or 180 degrees East/West longitude (+/- 180 degrees). The Greenwich (Prime) meridian thus bisects the window in the east-west direction. Similarly, the northern and southern edges of the grid and vector datasets are located at the poles, or 90 degrees North latitude (+90) and 90 degrees South latitude (-90); and the equator bisects the map in the north-south direction. The boundaries for this global window are thus:
Maximum longitude (X) : +180 degrees (East) Minimum longitude (X) : -180 degrees (West) Maximum latitude (Y) : +90 degrees (North) Minimum latitude (Y) : -90 degrees (South)Such information appears in the header (.DOC and .DVC) files stored in the META directory for use by the GIS software (described below). Note that the geographic limits indicated in these header files refers to the bounding rectangle of the image/map, not the centroid of raster cells (as required in some GIS window specifications, e.g., GRASS), nor, in the case of vector data, does it refer to the actual geographic limits of the data.
By convention, the software provided with the database counts row and column coordinates starting with zero (0). Using this convention, the corner pixels of a ten minute dataset (such as the Global Vegetation Index) are:
(column,row) North West corner: (0,0) North East corner: (2159,0) South West corner: (0,1079) South East corner: (2159,1079)Similarly, the corner pixels in a one degree grid are:
(column,row) North West corner: (0,0) North East corner: (359,0) South West corner: (0,179) South East corner: (359,179)
Database Structure and Formats
The number of values in a raster data file will equal the number of rows multiplied by the number of columns. All raster data are stored as binary files, as one or two-byte integer data types or 4-byte IEEE floating point data type. The byte count for each file can be calculated by multiplying the number of values (Rows x Columns) by the number of bytes per value (1, 2, or 4). This sequencing of values in the "image" is shown in the following diagram (for a five column image):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Database Structure and Formats
Data on other "non-nested" grids, as discussed earlier, present a special problem because they require re-sampling before they can be geographically compared to data on different grids. While there are many methods for this, they all introduce statistical issues that would best be evaluated by original investigators, after careful analysis. Examples of datasets which presented integration problems are provided. For example, methods may vary for comparing data on a 2-minute grid with data on a 5-minute grid. Or for comparing data centered on 1/2 degree latitude and longitude meridians, with similar thematic data, identically scaled but edge-matched on 1/2 degree meridians. Or for comparing data with rectangular 2.5 X 2 degree cells with regular grids at 1 degree spacing.
Different assumptions must be made in the process depending on the nature of the data. The necessary re-sampling for such comparisons will always involve greater uncertainty than re-producing the differently scaled or registered grid from original data.
It is clear that the problem of resampling for comparability is best solved by working with the principal investigators to re-produce the data on one of the recommended grids. This would ensure the ability to carefully document and defend resampling and interpolation methods. Nevertheless, this is not always practical and user feedback may be important for establishing priorities. While it is possible that aggregation of these data to coarser scales may reduce concern for overcoming such constraints in a given study that requires only coarser information and has developed adequate aggregation methods; the need for quality assessment requires intercomparison with other data at the full sampling resolution. This is a central issue in establishing "nested-grids," and it is important to know if the proposed grid convention can adequately deal with the problem. These issues may encourage data developers to support the nested grid structure, to facilitate intercomparison for quality assessment and validation purposes.
In cases where an original dataset does not extend to full global coverage, or has missing values, its grid was filled or padded with appropriate "no data" values to complete the global grid. Padding to a larger window than the original dataset is theoretically unnecessary for software systems that have full georeferencing (e.g., GRASS and Idrisi, although Idrisi 4.0 does not yet fully utilize it's georeference information). Nevertheless, padding allows the specification of more explicit "no data" flags, preventing ambiguity and simplifying overlay in non-georeferenced systems.
When re-gridding from fractional grid dimensions was required to represent the data within the nested-grid convention, the decision to aggregate or expand to one of the common grids was determined by knowledge of spatial variability. Simple area-weighted averaging (using the percentage of overlap of output cells with the originals) was used in some cases, if the spatial variability was known to be on the order of a pixel. This method is spatially un-biased but has the effect of "smoothing" the data. For example, the NGDC Monthly Generalized GVI was produced by aggregating in this way to 10-minutes cells from approximately 8.6-minute cells of the source "plate carreé" grids. An alternative method, given similar spatial uncertainty, is to re-sample by a "nearest-neighbor" method, which has the advantage of preserving the original data values, but at the cost of some aliasing and spatial distortion (thus implying that spatial accuracy is limited by the cell size). In other cases, where the statistical validity of such averaging cannot be assessed, a more conservative approach may be taken of re-sampling the data to a finer grid in the nested structure, to preserve the original variability or actual data values within the new grid. This allows intercomparison of the original and interpolated values with other data in the nested-grid structure, and may be a reversible process: The cost in this case is storage space for the replicated data. In all other cases, where no geo-referencing problems existed, the grid representation in the database reflects the original geographic sampling.
The cell size in a given data file is determined by the global window and the number of rows and columns (not the "resolution," which refers to the resolution of the original data sampling, not necessarily the current grid). Since cell sizes can vary between raster data layers according to the nested-grid convention (even multiples or divisions of cell size), all overlay and intercomparison operations must expand or contract the maps accordingly. At present, there are few systems which do this automatically (e.g., GRASS). In all other cases, the user must expand or contract the maps in a separate operation prior to overlay or intercomparison (see implementation section for specific operations in Idrisi and GRASS). The nested-grid convention allows this to be accomplished by simple pixel replication, without interpolation. While allowing direct intercomparison, this method does not change the original spatial definition of the data, which must always be taken into consideration. (Decimation -- i.e., choosing every nth pixel -- or aggregation -- e.g., averaging or reclassifying -- to coarser grids must be done more carefully to ensure valid geographic and numerical definitions.)
Database Structure and Formats
SAMPLE.DOC
file title : Sample data file header | Name of the spatial data element |
data type : byte | Byte, integer, or real |
file type : binary | File storage method (ASCII or binary) |
columns : 2160 | Number of vertical columns in grid |
rows : 1080 | Number of horizontal rows in grid |
ref. system : lat/long | Georeference system (projection) |
ref. units : deg | Georeference basic unit of measure |
unit dist. : 1.0000000 | Georeference unit multiplier |
min. X : 180.0000000 | West-most geographic coordinate |
max. X : 180.0000000 | East-most geographic coordinate |
min. Y : 90.0000000 | South-most geographic coordinate |
max. Y : 90.0000000 | North-most geographic coordinate |
pos'n error : 0.1666667 | Spatial location uncertainty |
resolution : 0.1666667 | Sampling interval of ORIGINAL data |
min. value : 0 | Minimum data value in the dataset |
max. value : 32 | Maximum data value in the dataset |
value units : scalars or classes | Data values units of measurement |
value error : estimate | Estimated error of data values |
flag value : 255 | Special value to flag non-data |
flag def'n : no data | Definition of flag value |
legend cats : 3 | Number of legend categories |
category 0 : Water | Legend Category 0 |
category 1 : Podzol Soils | Legend Category 1 |
category 2 : Brown Podsolic Soils | etc. |
Lineage : see documentation | Lineage line (as many as needed) |
Comment : notes as needed | Comment line (as many as needed) |
Database Structure and Formats
{ID#} {#Points} xxx.xxxxxx yy.yyyyyy xxx.xxxxxx yy.yyyyyy {ID#} {#Points} xxx.xxxxxx yy.yyyyyy xxx.xxxxxx yy.yyyyyy [etc. ] 0 0The first line is a point, segment, or polygon label. In the label line, the first number (ID#) is a feature identification number, and the second number (#Points) is the number of coordinate pairs that follow to define the feature. On each line following the header are the coordinate pairs that make up the feature. The first number (xxxx) is the "X" coordinate (longitude) and the second (yyyy) is the "Y" coordinate (latitude). Each feature (point, line, or polygon) begins with a new label line, followed by its coordinate pairs for as many points as define the feature. The last line of each vector file consists of two zeros, separated by one or more spaces (this terminates the listing of features). Coordinates are in the same units as the bounding rectangle specified in the vector metadata (.DVC) file, with positive values indicating East (X) and North (Y), and negative values indicating West (X) and South (Y). In this format, point data files would have only one coordinate pair defining each feature (point), line data files can handle up to 750 coordinate pairs for each feature (this may have been increased - check your Idrisi manual), and an unlimited number of features. Polygon data files are similar to line data files except that each feature is closed (i.e., the starting point and ending point are identical). The units and reference system for coordinate values used in the vector files are defined in their corresponding metadata files.
Database Structure and Formats
file title : Sample vector file | Name of the spatial data element |
id type : integer | Data type for feature labels (integer) |
file type : binary | File storage method |
object type : line | Feature (point, line or polygon) |
ref. system : lat/long | Georeference system (projection) |
ref. units : deg | Georeference basic unit of measure |
unit dist. : 0.0166667 | Georeference unit multiplier |
min. X : 10800.0000000 | Minimum x coordinate of study area |
max. X : 10800.0000000 | Maximum x coordinate of study area |
min. Y : 5400.0000000 | Minimum y coordinate of study area |
max. Y : 5400.0000000 | Maximum y coordinate of study area |
pos'n error : unknown | Spatial uncertainty |
resolution : 0.01666667 | Mean point spacing (sample resolution) |
comment : notes as needed4 | Comment line (as many as needed) |
SPECIAL NOTES:
(1) The bounding rectangle corresponds to the global window as described above (not the actual range of values in the file), in the spatial units specified.
(2) The vector data and metadata files are compatible with the Idrisi version 4.0 GIS and Idrisi for Windows 2.0, which are fully georeferenced.
Database Structure and Formats
1 570 Morrison etc... 2 860 Dakota Ridge 3 120 Blue Mountain etc.These "flat files" may contain variable width columns of attribute data up to a total record length of 255 characters. Each attribute column may contain integer values, real (floating point) values, or character strings.
Note: Attributes are stored as Attribute Database Tables in MS Access 2.0 format (using a .MDB file extensioin), however the table is also provided in flat ascii format in identically named .TXT files. The .DVL metadata files (described below) refer to the Database files (.MDBs).
Database Structure and Formats
file title : Example Attribute Doc. | Name of attribute data element |
file type : ascii | File storage method (ASCII only) |
records : 3 | Number of values or codes in spatial file |
fields : 2 | Number of attribute fields |
field 0 : identifiers | Feature ID field (matches spatial file) |
data type : integer | Type of field (integer, real, string) |
format : 0 | Field format code (unused) |
field 1 : biomass | Field definition |
data type : integer | Type of field (integer, real, string) |
format : 0 | Field format code (unused) |
min. value : 100 | Minimum value in column (null=0) |
max. value : 2532 | Maximum value in column (null=0) |
value units : gms./sq.m. | Units for field values |
value error : unknown | Class or value uncertainty |
flag value : 255 | Special flags in field data |
flag def'n : no data | Flag definition |
field 2 : site names | Field definition |
data type : string | Type of field (integer, real, string) |
format : 0 | Field format code (unused) |
min. value : 0 | Minimum value in column (null=0) |
max. value : 0 | Maximum value in column (null=0) |
value units : classes | Units for field values |
value error : unknown | Class or value uncertainty |
flag value : none | Special flags in field data |
flag def'n : none | Flag definition |
[add'l fields up to 20 ] | [repeat previous 9 lines for each field |
[ " ] | [up to a record length of 255 characters] |
legend cats : 3 | Number of legend categories |
category 0 : class def. zero | Category 0 (alternative legend) |
category 1 : class def. one | Category 1 (alternative legend) |
category 2 : class def. two | Category 2 (alternative legend) |
comment : notes as needed6 | Comment line (as many as needed) |
Database Structure and Formats
0 0 0 0 {solid black} 1 23 43 63 2 63 63 63 {full white} etc.
Database Structure and Formats
Database Structure and Formats
The ASCII format on the Version 1 CD-ROMs uses the standard DOS convention
for terminating a line with two bytes, a carriage return (hexadecimal 0D)
followed by a line feed (hexadecimal 0A). In DOS, ASCII files are commonly
ended with a DOS end-of-file mark (hexadecimal 1A), but this EOF mark is
not required by most software, including that provided with the online
version, and is thus not used to avoid incompatibilities with other systems.
The binary data files (.IMG) also do not have an EOF mark. File size is
thus an exact byte count of the data.