pmc logo imageJournal ListSearchpmc logo image
Logo of procamiaJournal URL: redirect3.cgi?&&auth=0VI1acx_R07TJfHJhdfz3eyDhXlnEWo_lptLr_kB_&reftype=publisher&artid=1480320&article-id=1480320&iid=131751&issue-id=131751&jid=362&journal-id=362&FROM=Article|Banner&TO=Publisher|Other|N%2FA&rendering-type=normal&&http://www.amia.org/meetings/archives.asp
AMIA Annu Symp Proc. 2003; 2003: 279–283.
PMCID: PMC1480320
Enhancing Quality of Retrieval Through Concept Edit History
Francis W. Hartel, Ph.D.,1 Gilberto Fragoso, Ph.D.,1 Kim L. Ong, Ph.D.,2 and Robert Dionne, M.S.3
1 Center for Bioinformatics, National Cancer Institute, Bethesda MD
2 Northrop Grumman Mission Systems, Reston VA
3 Apelon Inc, Ridgefield CT
Abstract
The NCI Thesaurus™ is a public domain description logic-based terminology produced by the National Cancer Institute. The NCI Thesaurus™ is used to support storage and retrieval of scientific, clinical and research administration data. The content of the NCI Thesaurus™ evolves rapidly. We have developed a representation of concept change over time and have implemented software to capture concept change in our multi-editor concurrent vocabulary development environment. We are now implementing software to extend our vocabulary server, public APIs and our file-based distributions of the Thesaurus to provide access to the concept-level history information.
BACKGROUND

This paper describes the concept-level history tracking mechanism used in the NCI Thesaurus™ and the software used to implement and publish the Thesaurus history data. The NCI Thesaurus™ is a description logic based vocabulary produced by the National Cancer Institute (NCI). The Thesaurus has been designed to satisfy Cimino’s desiderata1. NCI uses the Apelon Terminology Development Environment© (TDE) and the Apelon Workflow© products to create and maintain the Thesaurus, and the Apelon Distributed Terminology Server© (DTS) to host the vocabulary. NCI makes the NCI Thesaurus ™ freely available to the public. Public application programming interfaces and distribution files of the Thesaurus are available through the NCI caCORE© (see http://ncicb.nci.nih.gov/core).

The NCI Thesaurus™ is intended to encompass all the terminology used by the NCI in the course of its operations. The NCI Thesaurus™ is domain specialized. It is deep and complex compared to most broad clinical vocabularies. It embodies rich semantic interrelationships between the nodes of its taxonomies. The semantic relationships asserted in the Thesaurus are intended to facilitate translational research2 and to support the bioinformatics infrastructure of the Institute.3 The February, 2003 release of NCI Thesaurus™ contained about 26,000 concepts and about 71,000 terms divided among 24 taxonomies. The taxonomies cover administrative, applied and basic science and clinical terminology. Given the rapid pace of research findings and clinical refinement related to cancer prevention and treatment, the content of the Thesaurus evolves rapidly. We have adopted a monthly release cycle for the Thesaurus to keep the content of the Thesaurus as current as possible. This short update cycle, and the fairly broad area of coverage, has led us to adopt a concurrent model4 of vocabulary development.

The Thesaurus is used by NCI and by external organizations as a source of concepts; it is then used as tags, codes or other markers that are assigned to artifacts when they are stored in data repositories. Similarly, the Thesaurus is used as a source of retrieval codes. The cancer images portal http://cancerimages.nci.nih.gov/caIMAGE/index.jsp is an example of a resource that depends on the Thesaurus in this way.

STATEMENT of PROBLEM

The codes or retrieval keys applied to information artifacts stored in such repositories are essentially static, reflecting the content and structure of the NCI Thesaurus™ at the time the information was stored. Retrieval operations, especially operations like explosion and aggregation, will produce incorrect results if the content or structure of the Thesaurus has changed since the information in the repository was stored. Retrieval depends on the current state of the Thesaurus; storage reflects a prior state. We have devised and implemented a concept-level history mechanism that allows retrieval software to assemble queries that correctly retrieve artifacts stored over time, effectively resolving these sorts of retrieval problems.

For example, consider the concept Oncogene ras. Subsequent to the identification of Oncogene ras, it was found that there are multiple ras genes. The concepts Oncogene ras, HRAS Gene, and KRAS Gene could have, therefore, been used at different times to code information in repositories. When searching for information in such repositories, how is one to decide if a search for HRAS Gene also ought to search for information coded with an Oncogene ras tag? How would one know at what point in time the Oncogene ras fell out of use? How would one even know to consider the issue of change in information tagging over time?

The remainder of this paper describes the approach we have taken to produce concept-level history and to make it available to users of the NCI Thesaurus.

APPROACH

In the NCI Thesaurus™ we distinguish among several types of editing actions that result in changes to the vocabulary that affect retrieval operations in external repositories. These actions are: Create, Modify, Split, Merge, and Retire. Table 1 defines these actions.

Table 1Table 1
Concept Edit Actions

Standard development tools The standard version of the Apelon TDE version 2.2.1, which is our current base development platform, logs only three edit actions: create, modify and delete. Although it is possible to produce the effect of splitting and merging concepts with the standard TDE tool, insufficient information about these actions are recorded to avoid problems with interpretability of the results. Concepts created through split-like actions are confounded by newly created concepts. For example, consider deprecating the concept Oncogene ras and creating HRAS Gene and KRAS Gene as replacement concepts. The relationship of the new ras concepts to their predecessor concept would be undefined in the standard TDE. The user of the Thesaurus would have to guess if HRAS Gene and KRAS Gene took the place of Oncogene ras, or if they were newer alternative concepts. In the context of retrieval from a repository, should a search for HRAS Gene also search for information codes with Oncogene ras? Without an explicit split operation, the Thesaurus would be silent on the issue.

Similarly, concepts merged into other concepts would be indistinguishable from deleted concepts. The inability of the standard form of the TDE data to explicitly track merge and split edit actions will result in low recall rates in information retrieval operations that depend on the namespace.

NCI Thesaurus™ development tools To implement concept-level history tracking the TDE tool was extended to support split and merge operations. In addition, we disallowed deletions and implemented instead a retirement scheme; the TDE was extended as well to support this operation. These extensions are presented to the user as modal dialog boxes that the user calls from the TDE tool bar. We used modal dialog boxes in order to assure that the split, merge, and retire operations would not be left in an incomplete state by editors.

The split action always involves the creation of a new concept. After a new concept is created in a split action in the Split modal dialog box, roles (semantic assertions between concepts) or properties may be dragged from the existing to the new concept. This results in partitioning existing information between the split and the new concept. However, if either concept requires additional editing, it must be done in a regular TDE editing panel after the split action is completed. Only when the editor saves the split does the dialog box close and are records added to the editor’s TDE history table.

The Merge action always involves a concept retirement. In the Merge modal dialog box, concept data are displayed in trees as non-editable nodes. When the editor confirms the merge action, non-redundant roles and properties are automatically transferred from the retired concept to the resultant merged concept. And, only when the editor saves the merge, does the dialog box close and are records added to the editor’s TDE history table.

Concept retirement involves two steps called Pre-retirement and Retirement. The pre-retirement panel displays a list of the sub-concepts of the retiring concept and the concepts that target the retiring concept via roles. For a concept to be eligible for retirement, all its sub-concepts must be re-treed and role assertions be removed or retargeted. The editor uses the pre-retirement panel to perform these operations. When all pre-conditions are met, the editor can submit this concept for Retirement. A lead editor with sufficient privilege can use a retirement panel to retire the concept at a later time. The retire panel shows a non-editable tree containing concept definition information pertinent to the retiring concept. Up until this point, Pre-retirements can be reversed by an Undo action. When the lead editor commits the retirement to the database, the Retirement action is recorded in the editor’s TDE history table and it becomes permanent.

History data in the development environment The Ontylog© database schema was extended to include a concept history table. This table records TDE editing actions independently of the TDE and its associated workflow management tool. When edit operations are committed the corresponding edit events are logged in this table. The structure of the concept history table used in the TDE environment is shown in Table 2.

Table 2Table 2
Structure of the TDE History Table

TDE History table contains nine columns, as shown in Table 2. Each time an editor commits an edit action a set of records is entered in this table. A sequential record number, the date and time, the identity of the edited Thesaurus database and the workstation identity are included in all records. The content of the column Concept_Code is the concept code of the concept participating in or affected by the editor’s action. The concept’s name is recorded in the record in the Concept_Name column. The Action column indicates the action taken by the editor. Permissible values are: create, modify, split, merge and retire. The Reference_Code is used to provide information about successor or descendant concepts affected by the action. The Published column indicates whether an entry has been processed for publication in the DTS history table (below).

The Reference_Code column may contain the concept code of a second concept participating in or affected by the editor’s action. The contents vary according to the action performed. The value will always be null if the action is Create or Modify. If the action is Split, three history entries will be created, one for the newly created concept, with a null Reference_Code, and two entries for the split concept: in the first one the Reference_Code is the code of the new concept, in the second one it is the code of the split concept. The Reference_Code column in these entries will contain the concept code of the newly created concept needed to disambiguate the original concept, as well as the code of the split concept, which is a descendant of itself in the split action. For Merge actions, the situation is very similar to a Split. There will be three history entries, two for the concept that will be programmatically retired during the merge, and one for the "winning" concept. The Reference_Code will be null in one of the entries for the concept that will be retired, while the second entry will have the code of the "winning" concept: thus this Reference column points to the concept into which the concept in the Concept_Code column is being merged. The Reference_Code in the history entry of the "winning" concept of the merge will be the same as the Concept_Code; essentially the concept points to itself as a descendant in the Merge action. Finally, if the action is Retire, there will be as many history entries as the concept has super-concepts. The Reference column in these entries will contain the concept code of the parent concepts, one parent concept per history entry. End-users with documents coded by such retired concepts can select any appropriate concept to replace the deprecated concept they previously used; however, they might be able to find a suitable replacement within a listing of the concept's parents at the time of retirement. Such a listing is maintained in the history table.

Effect of parallel multi-editor operation on history data Due to the rapid rate of evolution in cancer science, we find it vital to do multi-editor modeling of the NCI Thesaurus. In our environment, each editor has his or her own schema. Each week a common baseline is loaded into the editor’s schema, meaning that at the start of the week all editors have the same baseline. The lead editor provides a work assignment to each editor. When the assignment is complete, the editor exports a change set that reflects only the concept edits for the particular assignment from his or her schema and sends it to the lead editor. The lead editor uses the Work Manager tool to detect and resolve conflicts among the change sets. Then the work manager generates a new baseline and the cycle repeats. At the end of each month, the lead editor exports a baseline from the TDE environment. This baseline is imported into the DTS environment. This baseline is the monthly released version of the NCI Thesaurus.

Work management environment When change sets contain conflicting models of a concept, the Work Manager software detects the conflict. The lead editor uses the Work Manager software to resolve the conflict. This always involves rejecting some or all of the modeling in one or more of the change sets. The lead editor also may have to make other changes to the baseline that he or she is preparing from the change sets and the previous baseline. Generally these changes are required to eliminate inconsistencies in the description logic that prevent classification#.. When the lead editor has incorporated all pending change sets, resolved all conflicts and has successfully classified the namespace, he or she exports a new baseline. Because not all the editing changes and actions submitted by the modelers to the lead editor make it to the final baseline published in the DTS, history information needs to be processed for the release version of the Thesaurus. This processing involves a comparison of the baseline against the previous release version in conjunction with the TDE history table entries that have not yet been published. In effect, this process deletes the history records associated with the rejected editing in conflicting change sets.

History data in Thesaurus releases The process for exporting baseline data from TDE to DTS environment involves suppressing some of the information retained in the TDE history table. Information useful only to the lead editor, specifically the concept name, editor identification and editor workstation are dropped. In addition, the publication flag gets dropped, and the time stamp from the TDE is replaced with the date of the DTS baseline export.

The structure of the concept history table included in each release of the NCI Thesaurus™ is shown in Table 3.

Table 3Table 3
Structure of the DTS History Table

The content of the History_ID column is the sequence number for the edit action. (It corresponds to the History_ID in the TDE history table.) The content of the column Concept_Code is the concept code of the concept participating in or affected by the editor’s action. The Action column indicates the action taken by the editor. Permissible values are: create, modify, split, merge and retire. The Baseline_Date column records date of the NCI Thesaurus™ release in which the action occurred. The Reference_Code will contain concept codes as explained in the description of the TDE history table, above.

Access to the History information We have extended the DTS API provided by Apelon to include a DTS history API. The Apelon and NCI-developed APIs are used internally at NCI but are not accessible to the general public, primarily because we wish to present a publicly-accessible interface through caCORE. In order to provide developers outside of NCI access to the NCI Thesaurus™, NCI includes an ontology object, instantiated with classes and methods, in the caCORE© distribution3. Release 1.2 of the caCORE© is scheduled for June 2003. It will include methods to access the history table data in the NCI Thesaurus™. Although the history information will be available through this API, for the most part history will be automatically examined internally by commonly used methods and thus its use should be transparent to users.

For those who do not need to directly develop programs using the caCORE© API, but would like to have access to the NCI Thesaurus, the content of the Thesaurus is included in caCORE© releases. The Thesaurus is provided in several formats. The formats to be available in the caCORE 1.2 release are: pipe delimited ASCII, Ontylog© XML, OWL© and Protégé© XML.

DISCUSSION

Use of history information to improve retrieval An example of how query construction using the history information supplied with the NCI Thesaurus™ can provide superior retrieval can be drawn from the earlier example of the ras genes. Assume the concept Oncogene ras had been created in the January 2000 release of the Thesaurus, and that Oncogene ras had been split into KRAS Gene and HRAS Gene and then retired in the July 2001 release. Assume that the concept ras Family Gene was created in the release of July 2002, and that the KRAS Gene and HRAS Gene concepts were treed under it. If data related to KRAS Gene has been stored in a genomic repository over the period 2000 – 2002 and if the data had been tagged with retrieval keys from the Thesaurus, then some of the earlier data might have been tagged Oncogene ras, some KRAS Gene, and perhaps some ras Family Gene. An explicit query for KRAS Gene covering the period 2001 thru 2002 would miss any data tagged with ras Family Gene. However the query application could infer from the history data that during this period there were two valid tags that might have been used, and so it could be expanded by the query to search for both KRAS Gene and ras Family Gene. Similarly, if the period covered were 2000 thru 2002, the expanded query could be expanded to Oncogene ras, KRAS Gene and ras Family Gene. This would assure that data relevant to the explicit query KRAS Gene would recover data keyed with all concepts valid at various periods of time, even though the concepts denote different levels of granularity.

Conclusion We have extended the TDE© editing software to produce concept-level history data. Some of the implementation details of the concept-tracking history mechanism described here are specific for the Apelon tools; however, none of the crucial functionality depends on proprietary software and it should be relatively straightforward to implement in other editing environments. We are now extending the Apelon Distributed Terminology Server© and the NCI caCORE© to support application programming interface access to the history data. We are also extending the flat file format releases of the Thesaurus to include the history table data. These development efforts are expected to be complete in time for the caCORE© 1.2 release scheduled for June 2003.

Footnotes
#The Apelon Ontylog© dialect of description logic makes no distinction between classes and instances. Hence there are differences between Ontylog© and other description logic dialects that affect classification.
REFERENCES
1.
Cimino, JJ. Desiderata for Controlled Medical Vocabularies in the Twenty-First Century. Methods in Information Medicine. 1998;12:394–403.
2.
Mulshine, JL; Jett, M; Cuttitta, F; Treston, AM; Quinn, K; Scott, F; Iwai, N; Avis, I; Linnoila, RI; Shaw, GL. Scientific Basis for Cancer Prevention. Intermediate cancer markers. Cancer. 1993;72(3 Suppl):978–983. [PubMed]
3.
Covitz, PA, Hartel, FW, Schaefer, C, de Coronado, S, Fragoso, G, Sahni, H, Gustafson, S and Buetow, K. caCORE: A Common Infrastructure for Cancer Informatics. Bioinformatics, submitted.
4.
Campbell, K.E. Distributed Development of a Logic-Based Controlled Terminology, Dissertation Abstracts International, 01599569, 1997.