Syntactic and Semantic Validation within a Metadata Management System

+ Visit NASA.gov

a directory of Earth science data and services


	Mission Statement

	State of GCMD

	Science User Working Group (UWG)

	Collaborations

	Software Documentation

	Metadata Standards

	Staff

	Publications/ Presentations

	Help Center

	Questions?



	Syntactic and Semantic Validation within a Metadata Management System

J. Pollack*, C. Gokey**, D. Kendig**, L. Olsen*

* Goddard Space Flight Center, Greenbelt, MD
** SSAI, Greenbelt, MD

Introduction:

One of the main challenges in maintaining a large directory of Earth science metadata lies in preserving the quality of the information in the records, while continually working to expand the population of the catalog. NASA's Global Change Master Directory (GCMD) is one such catalog, serving the Earth science community as a data locator. With over 9800 data set descriptions contributed to date, and additional descriptions created and modified on a daily basis, maintaining the integrity and consistency of the records is crucial to ensuring the quality of the metadata and user confidence in the system.

Syntax versus Semantics:

Rules that govern the formation and structure of a document define its syntax. What is meant by "valid" data or metadata? In today's world of technology, we often think of validation pertaining only to the syntax of the data. However, to ensure a meaningful representation of the data, we must also look at the meaning or interpretation of those data - the semantics.

In the English language both syntax and semantics play a role in the understanding and communication of ideas, and therefore the transfer of information. For example, the syntax of the language specifies that a noun must have a verb acting upon it, and adjectives can be used to clarify that noun. These words and other elements of sentence structure are combined to form grammatical sentences. The following sentence adheres to these syntactical rules:

"In his red coat, the boy walked to the school."

No rules have been broken in the syntax of the sentence and its meaning is clearly understood. Now, let's maintain proper syntax, but change some of the words:

"On her fuzzy popsicle, the horse swam through a balloon."

We've kept the proper order of prepositions, nouns, verbs, and adjectives, yet now the sentence seems illogical. What has changed if we still are maintaining perfect syntax? This sentence illustrates the importance of semantics. Although the words are correct in terms of their placement and type, the actual words used are not appropriate in this context. Perfect syntax alone does not guarantee a meaningful sentence. Similarly, for metadata, perfect syntax does not guarantee a meaningful description of a data set. It is for this reason that the Global Change Master Directory uses a metadata management system to validate not only the syntax of records, but the semantics as well, in order to achieve high quality data.

Benefits of Valid Data

The advantages of achieving metadata that conform to validity checks for syntax as well as semantics are many. The most obvious benefit is "quality" metadata. By this, we mean metadata that exhibit consistency across all records and are held to some level of standardization. So, for example, all descriptions are required to have a minimum number of mandatory fields, spatial and temporal coverages are expressed in the same format, terminology used to describe a scientific concept in one record is the same used to describe that concept in a different record, etc.

By achieving this ideal of "quality" metadata, we are then able to reap subsequent benefits in the form of improved searching capabilities and increased user confidence. The GCMD offers many interfaces with which users can search our metadata repository. With the majority of searches conducted using the keyword search, it, as well as the other interfaces (free-text) are strengthened by the consistency and standardization found within the records. Keyword searches, which guarantee no zero-hit queries would not be possible without the controlled vocabularies maintained by the GCMD. These keyword searches allow the user to click on a keyword of interest and reveal all of the data set descriptions in the data base that match that keyword. With standardized vocabulary, a user can be sure they are retrieving all documents related to "United States of America" when choosing that option, and that they do not also have to then check other variations of that location, such as "America" or "USA". Similarly, with a free-text search, a user can be assured of comparable content among the records. Therefore, when doing a search on "glaciers", the fact that all data set descriptions covering this topic are keyed with the controlled hierarchical terminology of "Earth Science > Cryosphere > Snow/Ice > Glaciers" ensures that all records that satisfy the query will be returned.

A similar feature that is directly related to the use of controlled vocabulary and our strict enforcement of validation is cross-referencing of keywords found within the body of the metadata document. When a user selects a title for display, all controlled keywords indexed in the document are hyperlinked. Behind this link is a list of all document titles in the GCMD that are indexed with this keyword. So, for example, if the user initially selects a data set description from the list and it contains the location keyword of "Cameroon", this word is hyperlinked to a list of all other data set descriptions in the system that are keyed with this location. This allows users to quickly find additional related data sets based on a number of topics of interest, such as location, scientific discipline, scientific project conducting the research, etc.

cross-links in the DIF display

Quality-controlling our metadata also results in the benefit of increased user confidence in the GCMD system. GCMD staff takes great efforts to ensure that information contained within the records are current and accurate. For example, several fields contain URLs for accessing project homepages, downloading data, etc. These URLs are checked on a monthly basis for broken or misdirected links. Similarly, personnel information is frequently updated to ensure that phone numbers and addresses are current. It is extremely important to keep this information up to date, because it serves as the user's connection to the actual data set. This sort of continual maintenance, while costly in terms of time and resources, is vital to promoting user trust in the system, and in the end, well worth the investment.

Having metadata that are viewed as high quality information also opens doors to information and knowledge sharing, and interoperability with other organizations. Data that exhibit this quality inherently lend themselves to more focused, narrow searches. With the possibility of narrow searches and the fact that we maintain an open API, we are attempting to make our software available for others to build clients that can use this quality metadata and then go one step further. For example, the Distributed Oceanographic Data System (DODS) is able to use the information contained in our data set descriptions and then take the user directly to the data.

Controlled Vocabularies and Standard Formats

A controlled vocabulary is a classified list of terms, used to describe and index documents based on subject matter in a consistent and predictable manner. Furthermore, it is "controlled" because both the terms used to represent subjects, and the process by which the terms are assigned to a particular document are controlled or performed by a person [1]. Because natural language can be highly ambiguous, simply matching descriptions to the terms in a given search can result in a set of documents that are not closely or significantly related. However, searching on terms included in the controlled vocabulary will result in the increased retrieval of relevant documents. Thus, a full text search will produce higher recall, while a controlled vocabulary search will result in higher precision [1].

                    Number of relevant documents retrieved
Recall   =  ---------------------------------------------------------
                      Number of relevant documents

                     Number of relevant documents retrieved
Precision   =  ---------------------------------------------------------
                       Number of documents retrieved

The GCMD actively maintains controlled keyword lists for use with fields found within the DIF (Directory Interchange Format) document. These lists consist of: Earth science parameters, locations, platforms (sources), instruments (sensors), data centers, campaigns (projects), URL content type, chronostratigraphic unit, and IDN (International Directory Network) node. Personnel are similarly maintained, ensuring that multiple entries do not exists for one person, and that contact information is current. The resources involved in maintaining these lists are high; however, so are the benefits.

Currently, the most popular search interface we offer is the keyword search, with users conducting on average 1.37 keyword searches for every one free-text search. This interface presents an intelligent layout of the hierarchical science parameter keyword list, with each keyword being a link. With a simple click of the link, the user is then presented with a list of titles for all documents containing that keyword. This guarantees that the user will not conduct any queries that return a hitless result count, as well as eases the burden of searching by presenting the user with the search possibilities. Similarly, we offer keyword searches by location, data center, source, sensor, and project, with refinement of the initial search by any of the keyword types. This type of search interface would not be possible without the use of controlled vocabularies.

Another way the GCMD validates its content for syntax as well as semantics is by requiring the use of standard formats for spatial and temporal coverages, with only those values formatted in a specific manner passing our internal validation. Spatial coverage consists of four values defining the northernmost, southernmost, easternmost, and westernmost points covered by the data set. Values must be in whole degrees longitude or latitude, with either the +/- or N,S,E,W used to indicate direction from the Equator or Prime Meridian. Values not complying with these rules are considered invalid and the containing document is not loaded into the system.

Similarly, temporal coverage must conform to a specific format, namely YYYY-MM-DD. The field may contain both a start and stop date; however, it can consist solely of a start date if data collection is on-going.

Having the spatial and temporal coverages expressed in a well-defined, expected format is beneficial in many respects. Firstly, it allows us to write additional validation checks into our software to ensure the semantics are correct. For example, our MD (Master Directory) load software inspects the spatial coverage to make sure that the southernmost location is in fact more southerly than the northernmost location. If this is not the case, an error is thrown to alert the user. Similarly, with the temporal coverage, the software checks that the stop date in indeed later in time than the start date and warns the user if this is not the case. This leads to the discovery and resolution of errors in the data set description prior to the document being committed to the system. Enforcement of these specific formats also allows us to use the information contained in the fields. For example, with all records having a common format for spatial and temporal coverages, we can offer search interfaces that include the ability to specify a geographic coverage and time range of interest. If, however, we allowed users to populate these fields with any format they chose, this type of querying would not be possible.

XML Technologies:

Prior to the advent of XML, the structure of metadata was primarily the responsibility of the organization, with each defining its own structure for describing metadata. As a result, if there was a need to do anything meaningful with the metadata, it was necessary to write specific software to access the elements of the metadata (parser). Furthermore, to validate the consistency of the metadata, it was necessary to write additional specific software, translating into more code to maintain.

XML has provided a higher level of abstraction for describing any metadata. It does this through the use of a Document Type Definition (DTD), which is used to express the rules and relationships between elements in an XML document. The DTD defines the structure of the XML document and describes the hierarchy of how elements are nested. Tools have been created that provide the capability to check an XML document against its DTD to ensure its validity.

For Earth science metadata the DTD becomes extremely useful because it is possible to define the makeup of an XML document. The DTD can subsequently be provided to an outside organization, clearly identifying what fields are required and the required structure of the document.

With the recent passing of the XML Schema recommendation by the W3C, it is likely that XML Schema will come to play a much more significant role in the validation of XML documents. XML Schema overcomes some of the limitations of DTDs such as not being written in XML syntax and offering very little support for namespaces, as well as delivering much more robust validation capabilities in the areas of document content and data types. At the time of the initial design phase for MD8, Schema was immature and changing, causing us to rely solely on DTD for validation. However, with Schema now finalized and many organizations creating stable parsers, we do plan to investigate how we can incorporate it into our system, either in conjunction with or as a replacement for DTD.

Yet, the DTD or Schema alone does nothing but define the structure of an XML document. The tools are what give that structure life. The GCMD primarily uses two tools both written as part of the Apache project. The first of these tools is Xerces, an XML Parser, and the second tool is an XSL stylesheet processor called Xalan.

Xerces supports the XML 1.0 recommendation, as well as contains advanced parser functionality for DOM, SAX, and XML Schema. The parser enables documents to be inspected on an element-by-element basis, by implementing either a tree-based or event-based interface. Using those interfaces, the software then has access to the elements themselves, and is free to rearrange them, query their contents, etc. We use both the SAX and DOM APIs in our application to parse documents and traverse trees in processing of documents.

For processing XML documents and converting them into other flavors of XML or HTML, we use the Java implementation of the Xalan processor. Xalan fully implements the Extensible Stylesheet Language (XSL) Version 1.0 W3C Candidate recommendation. We have created various XSL stylesheets using the XSL Transformation language to convert DIF XML to FGDC (Federal Geographic Data Committee) XML, colon-delimited DIF, and HTML. Along with the use of controlled vocabularies and XSL, the Xalan processor allows us to have cross-linking within our DIF.

Java Technologies:

One of the reasons the MD8 system was written in Java was to take advantage of the built in features and packages that are widely available for handling XML and the parsing and validation of XML objects. The SAX parser, Xerces and Xalan packages have Java implementations and therefore are easy to incorporate without a lot of time and effort invested in developing document validators. However, some validation steps specific to the DIF fields cannot be handled by any generic product, and it is these types of validation checks that had to be customized as a part of the application.

After the metadata are validated, both syntactically and semantically, the document must be saved as a persistent object. The DIF object and its aggregate objects such as personnel, parameter valids, data center, etc., are stored in a relational database. While the GCMD is currently using an Oracle RDBMS, the business logic should not depend on the type of RDBMS used. To insulate the application layer from the underlying database, Java DataBase Connectivity (JDBC) is used. This layer of abstraction allows for various database repositories to be used with the application, thus making it much more versatile. Developers at GCMD have tested this and were successful in swapping from an Oracle to Postgres database with minimal effort. This is important because not all nodes within the CEOS International Directory of Nodes (IDN) use the same database vendor.

One objective of the IDN is to share metadata content among national space agencies and other Earth science organizations. After the metadata are validated and stored in a database, the record is broadcast to other participating MD8 IDN nodes. The objects are serialized and passed to the other sites using Java's RMI as the object broker. RMI is used because of its lightweight appeal and ease of use within Java. CORBA was deemed to be too heavy weight for our case because the objects being passed are relatively small and simple. Furthermore, the applications on both sides of the call are Java based. This freed us from the burden of dealing with IDL (Interface Descriptor Language) and writing object descriptors.

GCMD-specific Validation:

While the technologies previously mentioned certainly play a role in the goal of achieving syntactically and semantically valid metadata, they alone cannot accomplish the task. They are successful in performing very generic validation only, such as those steps that can be performed on any given document with a specified DTD. In order to conduct validation specific to the DIF document, we were required to write our own application that utilizes and works in conjunction with these tools to validate the information at a more detailed level.

In looking at the life cycle of the DIF, we are referring to step four, in which the DIF object is promoted to a valid DIF object. At this point in the cycle, the document has already been checked for valid elements as specified by the DTD; however, the semantics of the document have yet to be examined. It is here that our software steps in to take the validation to another level, and test that the contents of the various fields are appropriate in the context of the GCMD system and the DIF itself.

life cycle of the DIF

The validation steps performed by the GCMD software include checks of the controlled keywords, personnel, spatial, and temporal coverages. The software systematically steps through the document, examining the contents of each one of the fields as it is encountered. For each of the keyword fields, it extracts the contents of the field and compares this against all valids of that type (science parameter, source, sensor, etc.) currently listed in the database. If the valid in the document cannot be located in the database, the user is notified and prompted for some action. The operations gui presents the user with the valid as included in the document, along with three options for the course of action. At this point the user may 1) skip this valid, 2) accept the valid as listed in the document and create a new entry in the database, or 3) examine the current contents of the database and select one of these entries in place of that used in the document. This forces the user to take a second look at the document's contents before committing anything to the database, which in turn helps to prevent imprecise information from finding its way into the database. Personnel are handled in a very similar manner to keywords. The person listed in the document is checked against all personnel contained in the database for a match. If a perfect match is not found the user is presented with the record of the person in the document, along with all close matches of that person found in the database. Close matches represent any person having a matching last name, telephone or fax number, email, or address. The user is able to view the complete record for each of these close matches, and can then determine whether the person in the document should be one of those already existing in the database, or if the person should in fact be created as a new person.

Spatial and temporal coverages do not use keywords, but instead use standard formats, which can then be tested for appropriate values. For example, if the spatial coverage specifies that the westernmost point is 165E and the easternmost point is 165W, the software will warn the user that the coverage crosses the International Dateline. While the coordinates may be correct, it is also possible that the user inadvertently transposed the easternmost point for the westernmost, and this would catch such an error. Similar checks exist for crossing the Primer Meridian, as well as values indicating the coverage is a transect or point. Tests performed on the temporal coverage include checking values that would indicate a start date after a stop date (an error), or a start date equal to a stop date (a warning).

Summary:

With the advent of XML and Java technologies, the validation of data, and in this case metadata, has been made significantly easier. However, the focus of available validation techniques lies in the syntax, leaving the semantics of the data untested. To test for semantically valid data, it is necessary to employ additional technologies such as controlled vocabulary and standard formats. At the GCMD, we have developed a metadata management system that utilizes all of these methods in conjunction with one another to achieve the goal of incorporating only quality metadata into our system. Of course, such a process is not infallible so the possibility of erroneous data does exist; however, our system has been devised to minimize this potential as much as possible.

References:

[1] Rowley, Jennifer. "The controlled versus natural indexing language debate revisited: a perspective on information retrieval practice and research", Journal of Information Science, 20 (2), pp. 108-119, 1994

			+ Privacy Policy and Important Notices			Webmaster: Monica Holland Responsible NASA Official: Lola Olsen Last Updated: February 2009