GED User's Guide
Dataset Selection

DATA STRUCTURE AND FORMATS

System files consist of data files and accompanying metadata files, which contain necessary coded information to properly access, interpret, display, or analyze the data files. The GED file formats are compatible with the Idrisi Geographic Information System (GIS). Structure and format requirements of other systems will vary, necessitating conversion of both data and metadata files. Documentation here (or in the Idrisi for Windows User's Guide) describes the structure conventions used in the GED and the specific file formats.
  • Global Coordinates
  • Raster Grids
  • Nested Grids and Cell Registration
  • Raster Metadata Files
  • Vector Data Files
  • Vector Metadata Files
  • Attribute and Values Files
  • Attribute and Values Metadata Files
  • Color Palette Files
  • Reference System Parameter Files
  • Numerical Types.

  • Database Structure and Formats

    GLOBAL COORDINATES

    The global directory called GLGEO in the database consists of a collection of raster (cell) and vector (point, line or polygon) datasets (with file extensions ".IMG" and ".VEC," accordingly), each of which represents a global digital map (spatial distribution). All data are provided in a Cartesian orthonormal geodetic (Latitude/Longitude) mapping, also referred to as "Platte Carree" (and sometimes referred to as "geographic" or "unprojected").

    The global grids and vector datasets in this database all have a common origin and geographic window. The western and eastern edges of the grid and vector datasets fall on the International Date Line, or 180 degrees East/West longitude (+/- 180 degrees). The Greenwich (Prime) meridian thus bisects the window in the east-west direction. Similarly, the northern and southern edges of the grid and vector datasets are located at the poles, or 90 degrees North latitude (+90) and 90 degrees South latitude (-90); and the equator bisects the map in the north-south direction. The boundaries for this global window are thus:

    Maximum longitude (X) : +180 degrees (East)
    Minimum longitude (X) : -180 degrees (West)
    Maximum latitude (Y)  :  +90 degrees (North)
    Minimum latitude (Y)  :  -90 degrees (South)
    Such information appears in the header (.DOC and .DVC) files stored in the META directory for use by the GIS software (described below). Note that the geographic limits indicated in these header files refers to the bounding rectangle of the image/map, not the centroid of raster cells (as required in some GIS window specifications, e.g., GRASS), nor, in the case of vector data, does it refer to the actual geographic limits of the data.

    By convention, the software provided with the database counts row and column coordinates starting with zero (0). Using this convention, the corner pixels of a ten minute dataset (such as the Global Vegetation Index) are:

                       (column,row)
    North West corner: (0,0)
    North East corner: (2159,0)
    South West corner: (0,1079)
    South East corner: (2159,1079)
    Similarly, the corner pixels in a one degree grid are:
                       (column,row)
    North West corner: (0,0)
    North East corner: (359,0)
    South West corner: (0,179)
    
    South East corner: (359,179)

    Database Structure and Formats

    Raster Grid (image/map) Data Files

    Raster grid data are provided in a basic "image" structure, with sequential values in the file corresponding to a row-wise structure, beginning with the cell at the upper-left (North-West) corner of the geographic area. In the case of data represented as Cartesian (latitude/longitude) grids in the GLGEO directory, this corresponds to the cell with its north-west corner on the International Date Line, at the North pole. Rows are filled with sequential values from the data file up to the maximum row size (specified in the metadata file, described below), repeating this pattern for each row from top (North) to bottom (South). Clearly, an artifact of this projection is that the extreme North and South rows are artificially expanded longitudinally. In reality, they represent pie-shaped cells in a circular area around the pole with a radius of 1 pixel height. At the equator, the latitude/longitude projection results in cells that are more nearly rectangular in their true ground distance. Such uneven representation across the globe is a strong argument for using other tessellations (spatial sampling schemes), however; to date, most global data have been produced in latitude/longitude coordinates. In future versions of the database, other projections may be employed, beginning with the polar regions.

    The number of values in a raster data file will equal the number of rows multiplied by the number of columns. All raster data are stored as binary files, as one or two-byte integer data types or 4-byte IEEE floating point data type. The byte count for each file can be calculated by multiplying the number of values (Rows x Columns) by the number of bytes per value (1, 2, or 4). This sequencing of values in the "image" is shown in the following diagram (for a five column image):

    0
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
     

    Database Structure and Formats

    Nested Grids and Cell Registration

    A convention of commonly registered "nested" grids has been adopted whereby various grid sizes may be used, all of which are integer multiples or divisions of each other. The possible grid cell sizes in this scheme are 2-degree, 1 degree (60 minutes), 1/2 degree (30 minutes), 1/6 degree (10 minutes), and 1/12 degree (5-minutes), 1/60th degree (1-minute), and 1/120th degree (30-seconds). This system captures most commonly available grids, and allows intercomparison of datasets at any nested scale, without altering the numerical or classed values of the data or introducing geographic aliasing effects from "nearest neighbor" resampling. The analyst must understand, of course, that the comparison is between data of different scale, and that special considerations must be made in interpretation. Nevertheless, statistical comparisons between, for example, soil or vegetation classes at .5 degree, and similar classifications at 1 degree are useful for quality assessment purposes and to develop derivation or integration methods prior to using data in analysis or modeling. Similarly, comparisons between classed data and finer grid satellite data can be extremely useful for validation and quality control/assessment.

    Data on other "non-nested" grids, as discussed earlier, present a special problem because they require re-sampling before they can be geographically compared to data on different grids. While there are many methods for this, they all introduce statistical issues that would best be evaluated by original investigators, after careful analysis. Examples of datasets which presented integration problems are provided.  For example, methods may vary for comparing data on a 2-minute grid with data on a 5-minute grid. Or for comparing data centered on 1/2 degree latitude and longitude meridians, with similar thematic data, identically scaled but edge-matched on 1/2 degree meridians. Or for comparing data with rectangular 2.5 X 2 degree cells with regular grids at 1 degree spacing.

    Different assumptions must be made in the process depending on the nature of the data. The necessary re-sampling for such comparisons will always involve greater uncertainty than re-producing the differently scaled or registered grid from original data.

    It is clear that the problem of resampling for comparability is best solved by working with the principal investigators to re-produce the data on one of the recommended grids. This would ensure the ability to carefully document and defend resampling and interpolation methods. Nevertheless, this is not always practical and user feedback may be important for establishing priorities. While it is possible that aggregation of these data to coarser scales may reduce concern for overcoming such constraints in a given study that requires only coarser information and has developed adequate aggregation methods; the need for quality assessment requires intercomparison with other data at the full sampling resolution. This is a central issue in establishing "nested-grids," and it is important to know if the proposed grid convention can adequately deal with the problem. These issues may encourage data developers to support the nested grid structure, to facilitate intercomparison for quality assessment and validation purposes.

    In cases where an original dataset does not extend to full global coverage, or has missing values, its grid was filled or padded with appropriate "no data" values to complete the global grid. Padding to a larger window than the original dataset is theoretically unnecessary for software systems that have full georeferencing (e.g., GRASS and Idrisi, although Idrisi 4.0 does not yet fully utilize it's georeference information). Nevertheless, padding allows the specification of more explicit "no data" flags, preventing ambiguity and simplifying overlay in non-georeferenced systems.

    When re-gridding from fractional grid dimensions was required to represent the data within the nested-grid convention, the decision to aggregate or expand to one of the common grids was determined by knowledge of spatial variability. Simple area-weighted averaging (using the percentage of overlap of output cells with the originals) was used in some cases, if the spatial variability was known to be on the order of a pixel. This method is spatially un-biased but has the effect of "smoothing" the data. For example, the NGDC Monthly Generalized GVI was produced by aggregating in this way to 10-minutes cells from approximately 8.6-minute cells of the source "plate carreé" grids. An alternative method, given similar spatial uncertainty, is to re-sample by a "nearest-neighbor" method, which has the advantage of preserving the original data values, but at the cost of some aliasing and spatial distortion (thus implying that spatial accuracy is limited by the cell size). In other cases, where the statistical validity of such averaging cannot be assessed, a more conservative approach may be taken of re-sampling the data to a finer grid in the nested structure, to preserve the original variability or actual data values within the new grid. This allows intercomparison of the original and interpolated values with other data in the nested-grid structure, and may be a reversible process: The cost in this case is storage space for the replicated data. In all other cases, where no geo-referencing problems existed, the grid representation in the database reflects the original geographic sampling.

    The cell size in a given data file is determined by the global window and the number of rows and columns (not the "resolution," which refers to the resolution of the original data sampling, not necessarily the current grid). Since cell sizes can vary between raster data layers according to the nested-grid convention (even multiples or divisions of cell size), all overlay and intercomparison operations must expand or contract the maps accordingly. At present, there are few systems which do this automatically (e.g., GRASS). In all other cases, the user must expand or contract the maps in a separate operation prior to overlay or intercomparison (see implementation section for specific operations in Idrisi and GRASS). The nested-grid convention allows this to be accomplished by simple pixel replication, without interpolation. While allowing direct intercomparison, this method does not change the original spatial definition of the data, which must always be taken into consideration. (Decimation -- i.e., choosing every nth pixel -- or aggregation -- e.g., averaging or reclassifying -- to coarser grids must be done more carefully to ensure valid geographic and numerical definitions.)


    Database Structure and Formats

    Raster Metadata Files

    Each data file is accompanied by an ASCII metadata file which contains the ASCII header information necessary for the software to correctly interpret the corresponding data file. The following shows an example of a metadata file, with explanations on the right.

    SAMPLE.DOC
    file title : Sample data file header Name of the spatial data element 
    data type : byte Byte, integer, or real 
    file type : binary File storage method (ASCII or binary) 
    columns : 2160 Number of vertical columns in grid 
    rows : 1080 Number of horizontal rows in grid 
    ref. system : lat/long Georeference system (projection) 
    ref. units : deg Georeference basic unit of measure 
    unit dist. : 1.0000000 Georeference unit multiplier 
    min. X : ­180.0000000 West-most geographic coordinate 
    max. X : 180.0000000 East-most geographic coordinate 
    min. Y : ­90.0000000 South-most geographic coordinate 
    max. Y : 90.0000000 North-most geographic coordinate 
    pos'n error : 0.1666667 Spatial location uncertainty 
    resolution : 0.1666667 Sampling interval of ORIGINAL data 
    min. value : 0 Minimum data value in the dataset 
    max. value : 32 Maximum data value in the dataset 
    value units : scalars or classes Data values units of measurement 
    value error : estimate Estimated error of data values 
    flag value : 255 Special value to flag non-data 
    flag def'n : no data Definition of flag value
    legend cats : 3 Number of legend categories 
    category 0 : Water Legend Category 0 
    category 1 : Podzol Soils Legend Category 1 
    category 2 : Brown Podsolic Soils etc. 
    Lineage : see documentation Lineage line (as many as needed) 
    Comment : notes as needed Comment line (as many as needed) 


    Database Structure and Formats

    Vector Data Files

    Vector data are provided in either ascii or binary file format. The ascii format is:
    {ID#} {#Points}
    xxx.xxxxxx yy.yyyyyy
    xxx.xxxxxx yy.yyyyyy
    {ID#} {#Points}
    xxx.xxxxxx yy.yyyyyy
    xxx.xxxxxx yy.yyyyyy
    [etc. ]
    0 0
    
    
    The first line is a point, segment, or polygon label. In the label line, the first number (ID#) is a feature identification number, and the second number (#Points) is the number of coordinate pairs that follow to define the feature. On each line following the header are the coordinate pairs that make up the feature. The first number (xxxx) is the "X" coordinate (longitude) and the second (yyyy) is the "Y" coordinate (latitude). Each feature (point, line, or polygon) begins with a new label line, followed by its coordinate pairs for as many points as define the feature. The last line of each vector file consists of two zeros, separated by one or more spaces (this terminates the listing of features). Coordinates are in the same units as the bounding rectangle specified in the vector metadata (.DVC) file, with positive values indicating East (X) and North (Y), and negative values indicating West (X) and South (Y). In this format, point data files would have only one coordinate pair defining each feature (point), line data files can handle up to 750 coordinate pairs for each feature (this may have been increased - check your Idrisi manual), and an unlimited number of features. Polygon data files are similar to line data files except that each feature is closed (i.e., the starting point and ending point are identical). The units and reference system for coordinate values used in the vector files are defined in their corresponding metadata files.

    Database Structure and Formats

    Vector Metadata Files

    file title : Sample vector file Name of the spatial data element 
    id type : integer Data type for feature labels (integer) 
    file type : binary File storage method
    object type : line Feature (point, line or polygon) 
    ref. system : lat/long Georeference system (projection) 
    ref. units : deg Georeference basic unit of measure 
    unit dist. : 0.0166667 Georeference unit multiplier 
    min. X : ­10800.0000000 Minimum x coordinate of study area 
    max. X : 10800.0000000 Maximum x coordinate of study area 
    min. Y : ­5400.0000000 Minimum y coordinate of study area 
    max. Y : 5400.0000000 Maximum y coordinate of study area 
    pos'n error : unknown Spatial uncertainty 
    resolution : 0.01666667 Mean point spacing (sample resolution) 
    comment : notes as needed4 Comment line (as many as needed) 

    SPECIAL NOTES:

    (1) The bounding rectangle corresponds to the global window as described above (not the actual range of values in the file), in the spatial units specified.

    (2) The vector data and metadata files are compatible with the Idrisi version 4.0 GIS and Idrisi for Windows 2.0, which are fully georeferenced.


    Database Structure and Formats

    Attribute and Values Files

    Attribute files contain tabular data used for re-coding an existing spatial data element or attaching additional labels to coded features. They are used most commonly with vector datasets that define a set of boundary features for which there are numerous kinds of data (e.g., political boundaries with varying statistics for each polygon, or streams with varying measurements for each coded portion of the stream). Such data would be described as having only one spatial "distribution" but many attributes or data "variables". There is no technical reason that attribute files cannot also be attached to raster data files, where each data value has a unique translation in the attribute table. Attribute files, whether attached to vector or raster data elements of a dataset, are stored in either Attribute Database Table files (in a Database Management System format, or "DBMS") or Attribute Values files. Database files, which can be used by Idrisi's Database Workshop in Idrisi For Windows 2.0, are stored as Microsoft Access 2.0 format. The Attribute Values files may be two or more columns with the following structure:
    1 570 Morrison etc...
    2 860 Dakota Ridge
    3 120 Blue Mountain
    etc.
    These "flat files" may contain variable width columns of attribute data up to a total record length of 255 characters. Each attribute column may contain integer values, real (floating point) values, or character strings.

    Note:  Attributes are stored as Attribute Database Tables in MS Access 2.0 format (using a .MDB file extensioin), however the table is also provided in flat ascii format in identically named .TXT files. The .DVL metadata files (described below) refer to the Database files (.MDBs).


    Database Structure and Formats

    Attribute and Values Metadata Files

    Metadata files for attributes (either for Attribute Values Files or Attribute Database Table) use a .DVL file extension. The following is an Example .DVL file for a 2-column attribute values file:
    file title : Example Attribute Doc. Name of attribute data element 
    file type : ascii File storage method (ASCII only)
    records : 3 Number of values or codes in spatial file 
    fields : 2 Number of attribute fields 
    field 0 : identifiers Feature ID field (matches spatial file) 
    data type : integer Type of field (integer, real, string) 
    format : 0 Field format code (unused) 
    field 1 : biomass Field definition 
    data type : integer Type of field (integer, real, string) 
    format : 0 Field format code (unused) 
    min. value : 100 Minimum value in column (null=0) 
    max. value : 2532 Maximum value in column (null=0) 
    value units : gms./sq.m. Units for field values 
    value error : unknown Class or value uncertainty 
    flag value : 255 Special flags in field data 
    flag def'n : no data Flag definition 
    field 2 : site names Field definition 
    data type : string Type of field (integer, real, string) 
    format : 0 Field format code (unused) 
    min. value : 0 Minimum value in column (null=0) 
    max. value : 0 Maximum value in column (null=0) 
    value units : classes Units for field values 
    value error : unknown Class or value uncertainty 
    flag value : none Special flags in field data 
    flag def'n : none Flag definition 
    [add'l fields up to 20 ] [repeat previous 9 lines for each field 
    [ " ] [up to a record length of 255 characters] 
    legend cats : 3 Number of legend categories 
    category 0 : class def. zero Category 0 (alternative legend) 
    category 1 : class def. one Category 1 (alternative legend) 
    category 2 : class def. two Category 2 (alternative legend) 
    comment : notes as needed6 Comment line (as many as needed) 

    Database Structure and Formats

    Color Palette Files

    Color palette files (.PAL and .SMP) contain color code assignment metadata which can be used when displaying an image or vector file. The .SMP files are in binary format and are used by Idrisi's Display module. For convenience of the user, an similar .PAL file in ascii format is included. Red, Green and Blue color intensities are specified by these files according to numerial code, to a precision of 1 in 64.
    0        0       0       0      {solid black}
    1       23      43      63      
    2       63      63      63      {full white}
    etc.

    Database Structure and Formats

    Reference System Parameter Files

    These files (.REF) record metadata about the specific geographic reference system used for the dataset. They include data on their projection, elipsoid, datum, and numbering conventions. These files can be used by the Idrisi PROJECT module. 

    Database Structure and Formats

    Numerical Types

    Raster data files are stored using either byte binary, integer binary, or real (floating point) binary numbers. Byte data are coded in 8 binary bits, thus allowing numbers from 0 to 255 (decimal) to be stored. Integer data are coded as two 8-bit bytes (16 bits), in "little-endian" format, i.e., with the most significant byte on the right. Since one bit is used for the algebraic sign, numbers from -32,768 to +32,767 (decimal) can be stored as integer data. The order of the two bytes used for integer data conforms to conventions used by IBM (DOS) and DEC. Some systems (e.g., GRASS/UNIX) require data in "big-endian" format, with the most significant byte on the left (i.e., the order of the data must be reversed, or "byte-swapped," to be correctly interpreted), and representation of negative numbers may also vary between systems (see SYSTEM COMPATIBILITY). Binary real data types (floating point), because of compatibility issues, are used in the database only where it is impossible to represent the full range of values in the original dataset with integers. An IEEE standard 4-byte real data type is used (this may not be readable on all machines, but is common in DOS). The numerical type for each data file is indicated in their associated metadata files, which are stored in ASCII format. Vector data files (and their associated metadata files) are all stored in ASCII format.

    The ASCII format on the Version 1 CD-ROMs uses the standard DOS convention for terminating a line with two bytes, a carriage return (hexadecimal 0D) followed by a line feed (hexadecimal 0A). In DOS, ASCII files are commonly ended with a DOS end-of-file mark (hexadecimal 1A), but this EOF mark is not required by most software, including that provided with the online version, and is thus not used to avoid incompatibilities with other systems. The binary data files (.IMG) also do not have an EOF mark. File size is thus an exact byte count of the data.


     Return to top of page