|
J. Pollack*, C. Gokey**, D. Kendig**,
L. Olsen*
* Goddard Space Flight Center, Greenbelt,
MD
** SSAI, Greenbelt, MD
One of the main challenges in maintaining a large directory
of Earth science metadata lies in preserving the quality of
the information in the records, while continually working
to expand the population of the catalog. NASA's Global Change
Master Directory (GCMD) is one such catalog, serving the Earth
science community as a data locator. With over 9800 data set
descriptions contributed to date, and additional descriptions
created and modified on a daily basis, maintaining the integrity
and consistency of the records is crucial to ensuring the
quality of the metadata and user confidence in the system.
Rules that govern the formation and structure of a document
define its syntax. What is meant by "valid" data
or metadata? In today's world of technology, we often think
of validation pertaining only to the syntax of the data. However,
to ensure a meaningful representation of the data, we must
also look at the meaning or interpretation of those data -
the semantics.
In the English
language both syntax and semantics play a role in the understanding
and communication of ideas, and therefore the transfer of
information. For example, the syntax of the language specifies
that a noun must have a verb acting upon it, and adjectives
can be used to clarify that noun. These words and other elements
of sentence structure are combined to form grammatical sentences.
The following sentence adheres to these syntactical rules:
"In
his red coat, the boy walked to the school."
No rules
have been broken in the syntax of the sentence and its meaning
is clearly understood. Now, let's maintain proper syntax,
but change some of the words:
"On
her fuzzy popsicle, the horse swam through a balloon."
We've kept
the proper order of prepositions, nouns, verbs, and adjectives,
yet now the sentence seems illogical. What has changed if
we still are maintaining perfect syntax? This sentence illustrates
the importance of semantics. Although the words are correct
in terms of their placement and type, the actual words used
are not appropriate in this context. Perfect syntax alone
does not guarantee a meaningful sentence. Similarly, for metadata,
perfect syntax does not guarantee a meaningful description
of a data set. It is for this reason that the Global Change
Master Directory uses a metadata management system to validate
not only the syntax of records, but the semantics as well,
in order to achieve high quality data.
The advantages of achieving metadata that conform to validity
checks for syntax as well as semantics are many. The most
obvious benefit is "quality" metadata. By this,
we mean metadata that exhibit consistency across all records
and are held to some level of standardization. So, for example,
all descriptions are required to have a minimum number of
mandatory fields, spatial and temporal coverages are expressed
in the same format, terminology used to describe a scientific
concept in one record is the same used to describe that concept
in a different record, etc.
By achieving
this ideal of "quality" metadata, we are then able
to reap subsequent benefits in the form of improved searching
capabilities and increased user confidence. The GCMD offers
many interfaces with which users can search our metadata repository.
With the majority of searches conducted using the keyword
search, it, as well as the other interfaces (free-text) are
strengthened by the consistency and standardization found
within the records. Keyword searches, which guarantee no zero-hit
queries would not be possible without the controlled vocabularies
maintained by the GCMD. These keyword searches allow the user
to click on a keyword of interest and reveal all of the data
set descriptions in the data base that match that keyword.
With standardized vocabulary, a user can be sure they are
retrieving all documents related to "United States of
America" when choosing that option, and that they do
not also have to then check other variations of that location,
such as "America" or "USA". Similarly,
with a free-text search, a user can be assured of comparable
content among the records. Therefore, when doing a search
on "glaciers", the fact that all data set descriptions
covering this topic are keyed with the controlled hierarchical
terminology of "Earth Science > Cryosphere > Snow/Ice
> Glaciers" ensures that all records that satisfy
the query will be returned.
A similar
feature that is directly related to the use of controlled
vocabulary and our strict enforcement of validation is cross-referencing
of keywords found within the body of the metadata document.
When a user selects a title for display, all controlled keywords
indexed in the document are hyperlinked. Behind this link
is a list of all document titles in the GCMD that are indexed
with this keyword. So, for example, if the user initially
selects a data set description from the list and it contains
the location keyword of "Cameroon", this word is
hyperlinked to a list of all other data set descriptions in
the system that are keyed with this location. This allows
users to quickly find additional related data sets based on
a number of topics of interest, such as location, scientific
discipline, scientific project conducting the research, etc.
Quality-controlling
our metadata also results in the benefit of increased user confidence
in the GCMD system. GCMD staff takes great efforts to ensure
that information contained within the records are current and
accurate. For example, several fields contain URLs for accessing
project homepages, downloading data, etc. These URLs are checked
on a monthly basis for broken or misdirected links. Similarly,
personnel information is frequently updated to ensure that phone
numbers and addresses are current. It is extremely important
to keep this information up to date, because it serves as the
user's connection to the actual data set. This sort of continual
maintenance, while costly in terms of time and resources, is
vital to promoting user trust in the system, and in the end,
well worth the investment.
Having metadata
that are viewed as high quality information also opens doors
to information and knowledge sharing, and interoperability
with other organizations. Data that exhibit this quality inherently
lend themselves to more focused, narrow searches. With the
possibility of narrow searches and the fact that we maintain
an open API, we are attempting to make our software available
for others to build clients that can use this quality metadata
and then go one step further. For example, the Distributed
Oceanographic Data System (DODS) is able to use the information
contained in our data set descriptions and then take the user
directly to the data.
A controlled vocabulary is a classified list of terms, used to
describe and index documents based on subject matter in a consistent
and predictable manner. Furthermore, it is "controlled"
because both the terms used to represent subjects, and the process by
which the terms are assigned to a particular document are controlled
or performed by a person [1]. Because natural language can be highly
ambiguous, simply matching descriptions to the terms in a given
search can result in a set of documents that are not closely or
significantly related. However, searching on terms included in the
controlled vocabulary will result in the increased retrieval of
relevant documents. Thus, a full text search will produce higher
recall, while a controlled vocabulary search will result in higher
precision [1].
Number of relevant documents retrieved
Recall = ---------------------------------------------------------
Number of relevant documents
Number of relevant documents retrieved
Precision = ---------------------------------------------------------
Number of documents retrieved
The GCMD
actively maintains controlled keyword lists for use with fields
found within the DIF (Directory Interchange Format) document.
These lists consist of: Earth science parameters, locations,
platforms (sources), instruments (sensors), data centers,
campaigns (projects), URL content type, chronostratigraphic
unit, and IDN (International Directory Network) node. Personnel
are similarly maintained, ensuring that multiple entries do
not exists for one person, and that contact information is
current. The resources involved in maintaining these lists
are high; however, so are the benefits.
Currently, the most popular
search interface we offer is the keyword search, with users
conducting on average 1.37 keyword searches for every one free-text
search. This interface presents an intelligent layout of the
hierarchical science parameter keyword list, with each keyword being
a link. With a simple click of the link, the user is then presented
with a list of titles for all documents containing that keyword.
This guarantees that the user will not conduct any queries that
return a hitless result count, as well as eases the burden of
searching by presenting the user with the search possibilities.
Similarly, we offer keyword searches by location, data center,
source, sensor, and project, with refinement of the initial search by
any of the keyword types. This type of search interface would not be
possible without the use of controlled vocabularies.
Another
way the GCMD validates its content for syntax as well as semantics is
by requiring the use of standard formats for spatial and temporal
coverages, with only those values formatted in a specific manner
passing our internal validation. Spatial coverage consists of four
values defining the northernmost, southernmost, easternmost, and
westernmost points covered by the data set. Values must be in whole
degrees longitude or latitude, with either the +/- or N,S,E,W used to
indicate direction from the Equator or Prime Meridian. Values not
complying with these rules are considered invalid and the containing
document is not loaded into the system.
Similarly, temporal coverage must
conform to a specific format, namely YYYY-MM-DD. The field may
contain both a start and stop date; however, it can consist solely of
a start date if data collection is on-going.
Having the spatial and temporal
coverages expressed in a well-defined, expected format is beneficial
in many respects. Firstly, it allows us to write additional
validation checks into our software to ensure the semantics are
correct. For example, our MD (Master Directory) load software
inspects the spatial coverage to make sure that the southernmost
location is in fact more southerly than the northernmost location.
If this is not the case, an error is thrown to alert the user.
Similarly, with the temporal coverage, the software checks that the
stop date in indeed later in time than the start date and warns the
user if this is not the case. This leads to the discovery and
resolution of errors in the data set description prior to the
document being committed to the system. Enforcement of these
specific formats also allows us to use the information contained in
the fields. For example, with all records having a common format for
spatial and temporal coverages, we can offer search interfaces that
include the ability to specify a geographic coverage and time range
of interest. If, however, we allowed users to populate these fields
with any format they chose, this type of querying would not be
possible.
Prior to the advent of XML, the structure of metadata was
primarily the responsibility of the organization, with each
defining its own structure for describing metadata. As a result,
if there was a need to do anything meaningful with the metadata,
it was necessary to write specific software to access the
elements of the metadata (parser). Furthermore, to validate
the consistency of the metadata, it was necessary to write
additional specific software, translating into more code to
maintain.
XML has provided a higher level
of abstraction for describing any metadata. It does this through the
use of a Document Type Definition (DTD), which is used to express the
rules and relationships between elements in an XML document. The DTD
defines the structure of the XML document and describes the hierarchy
of how elements are nested. Tools have been created that provide the
capability to check an XML document against its DTD to ensure its
validity.
For Earth science metadata the
DTD becomes extremely useful because it is possible to define the
makeup of an XML document. The DTD can subsequently be provided to
an outside organization, clearly identifying what fields are required
and the required structure of the document.
With the recent passing of the
XML Schema recommendation by the W3C, it is likely that XML Schema
will come to play a much more significant role in the validation of
XML documents. XML Schema overcomes some of the limitations of DTDs
such as not being written in XML syntax and offering very little
support for namespaces, as well as delivering much more robust
validation capabilities in the areas of document content and data
types. At the time of the initial design phase for MD8, Schema was
immature and changing, causing us to rely solely on DTD for
validation. However, with Schema now finalized and many
organizations creating stable parsers, we do plan to investigate how
we can incorporate it into our system, either in conjunction with or
as a replacement for DTD.
Yet, the DTD or Schema alone does
nothing but define the structure of an XML document. The tools are
what give that structure life. The GCMD primarily uses two tools
both written as part of the Apache project. The first of these tools
is Xerces, an XML Parser, and the second tool is an XSL stylesheet
processor called Xalan.
Xerces supports the XML 1.0
recommendation, as well as contains advanced parser functionality for
DOM, SAX, and XML Schema. The parser enables documents to be
inspected on an element-by-element basis, by implementing either a
tree-based or event-based interface. Using those interfaces, the
software then has access to the elements themselves, and is free to
rearrange them, query their contents, etc. We use both the SAX and
DOM APIs in our application to parse documents and traverse trees in
processing of documents.
For processing XML documents and
converting them into other flavors of XML or HTML, we use the Java
implementation of the Xalan processor. Xalan fully implements the
Extensible Stylesheet Language (XSL) Version 1.0 W3C Candidate
recommendation. We have created various XSL stylesheets using the
XSL Transformation language to convert DIF XML to FGDC (Federal
Geographic Data Committee) XML, colon-delimited DIF, and HTML. Along
with the use of controlled vocabularies and XSL, the Xalan processor
allows us to have cross-linking within our DIF.
One of the reasons the MD8 system
was written in Java was to take advantage of the built in features
and packages that are widely available for handling XML and the
parsing and validation of XML objects. The SAX parser, Xerces and
Xalan packages have Java implementations and therefore are easy to
incorporate without a lot of time and effort invested in developing
document validators. However, some validation steps specific to the
DIF fields cannot be handled by any generic product, and it is these
types of validation checks that had to be customized as a part of the
application.
After the metadata are validated,
both syntactically and semantically, the document must be saved as a
persistent object. The DIF object and its aggregate objects such as
personnel, parameter valids, data center, etc., are stored in a
relational database. While the GCMD is currently using an Oracle
RDBMS, the business logic should not depend on the type of RDBMS
used. To insulate the application layer from the underlying
database, Java DataBase Connectivity (JDBC) is used. This layer of
abstraction allows for various database repositories to be used with
the application, thus making it much more versatile. Developers at
GCMD have tested this and were successful in swapping from an Oracle
to Postgres database with minimal effort. This is important because
not all nodes within the CEOS International Directory of Nodes (IDN)
use the same database vendor.
One objective of the IDN is to
share metadata content among national space agencies and other Earth
science organizations. After the metadata are validated and stored
in a database, the record is broadcast to other participating MD8 IDN
nodes. The objects are serialized and passed to the other sites
using Java's RMI as the object broker. RMI is used because of its
lightweight appeal and ease of use within Java. CORBA was deemed to
be too heavy weight for our case because the objects being passed are
relatively small and simple. Furthermore, the applications on both
sides of the call are Java based. This freed us from the burden of
dealing with IDL (Interface Descriptor Language) and writing object
descriptors.
While the technologies previously
mentioned certainly play a role in the goal of achieving
syntactically and semantically valid metadata, they alone cannot
accomplish the task. They are successful in performing very generic
validation only, such as those steps that can be performed on any
given document with a specified DTD. In order to conduct validation
specific to the DIF document, we were required to write our own
application that utilizes and works in conjunction with these tools
to validate the information at a more detailed level.
In looking at the life cycle of
the DIF, we are referring to step four, in which the DIF object is
promoted to a valid DIF object. At this point in the cycle, the
document has already been checked for valid elements as specified by
the DTD; however, the semantics of the document have yet to be
examined. It is here that our software steps in to take the
validation to another level, and test that the contents of the
various fields are appropriate in the context of the GCMD system and
the DIF itself.
The validation steps performed by
the GCMD software include checks of the controlled keywords,
personnel, spatial, and temporal coverages. The software
systematically steps through the document, examining the contents of
each one of the fields as it is encountered. For each of the keyword
fields, it extracts the contents of the field and compares this
against all valids of that type (science parameter, source, sensor,
etc.) currently listed in the database. If the valid in the document
cannot be located in the database, the user is notified and prompted
for some action. The operations gui presents the user with the valid
as included in the document, along with three options for the course
of action. At this point the user may 1) skip this valid, 2) accept
the valid as listed in the document and create a new entry in the
database, or 3) examine the current contents of the database and
select one of these entries in place of that used in the document.
This forces the user to take a second look at the document's contents
before committing anything to the database, which in turn helps to
prevent imprecise information from finding its way into the database.
Personnel are handled in a very similar manner to keywords. The
person listed in the document is checked against all personnel
contained in the database for a match. If a perfect match is not
found the user is presented with the record of the person in the
document, along with all close matches of that person found in the
database. Close matches represent any person having a matching last
name, telephone or fax number, email, or address. The user is able
to view the complete record for each of these close matches, and can
then determine whether the person in the document should be one of
those already existing in the database, or if the person should in
fact be created as a new person.
Spatial and temporal coverages do
not use keywords, but instead use standard formats, which can then be
tested for appropriate values. For example, if the spatial coverage
specifies that the westernmost point is 165E and the easternmost
point is 165W, the software will warn the user that the coverage
crosses the International Dateline. While the coordinates may be
correct, it is also possible that the user inadvertently transposed
the easternmost point for the westernmost, and this would catch such
an error. Similar checks exist for crossing the Primer Meridian, as
well as values indicating the coverage is a transect or point. Tests
performed on the temporal coverage include checking values that would
indicate a start date after a stop date (an error), or a start date
equal to a stop date (a warning).
With the advent of XML and Java technologies, the validation
of data, and in this case metadata, has been made significantly
easier. However, the focus of available validation techniques
lies in the syntax, leaving the semantics of the data untested.
To test for semantically valid data, it is necessary to employ
additional technologies such as controlled vocabulary and
standard formats. At the GCMD, we have developed a metadata
management system that utilizes all of these methods in conjunction
with one another to achieve the goal of incorporating only
quality metadata into our system. Of course, such a process
is not infallible so the possibility of erroneous data does
exist; however, our system has been devised to minimize this
potential as much as possible.
[1] Rowley, Jennifer. "The controlled versus natural
indexing language debate revisited: a perspective on information
retrieval practice and research", Journal
of Information Science, 20 (2), pp. 108-119,
1994
|