*167. Automated Textbook Indexing for Full Text Information Retrieval

DC Berrios, VA Palo Alto Health Care System

Objectives: To evaluatate a statistically based method of generating sentence-level indexing of medical textbooks based on identified UMLS concepts and query and vector-space models.

Methods: In addition to identifying concepts in full text, we designed a new method to capture contextual information from HTML-formatted documents. We used full-text concept identification methods to extract UMLS concepts from HTML headings, and then stored this contextual information as XML-compliant HTML for each sentence in the source document.

We then adapted Salton's query-document vector-based information retrieval model for use in proposing document markup. This model creates vectors for all documents in a collection and for any query that users compose. Each dimension in the vectors corresponds to unique terms in a document (or query). For each sentence, we ranked a closeness measure between the sentence vector and each of the twelve query prototype vectors. To evaluate the performance of the vector-space markup proposal model, we asked two domain experts, A and B, to mark up a chapter from a textbook of Infectious Disease. The chapter consisted of 70 paragraphs and 242 sentences. One expert used the automated markup tool (which displayed automatically identified concepts, but did not propose which query prototype to index) and one created markup using the same prototypes manually (on paper). The two indexers created a total of 305 instances of markup. If we exclude from the analysis instances that were discrepant because one indexer did not create markup, kappa increased to 0.82, indicating a large amount of agreement when both indexers felt markup was essential. We selected as a gold standard the 120 instances of consensus markup from two indexers.

Results: The indexing system proposed the correct set of concepts in the form of a query prototype 71% of the time. The mean closeness for each of the twelve query prototypes ranged from 0.016 for query prototype four (a pharmacokinetics query) to 1.79 for query prototype eight (a drug-susceptibility query). The average, top ranked closeness measures ranged from 1.91 to 3.39 for the 12 query prototypes. The mean top-ranked closeness measure for sentences that had no consensus markup was not significantly different than for sentences with consensus markup (3.02 vs. 3.03). Using contextual information increased the number of correctly proposed markup instances seven-fold. The correct query prototype was ranked first or second in 79% of cases.

Conclusions: Our results suggest an automated markup proposal system can perform well on full text medical information. Both contextual concepts and those identified locally in sentences were essential to predicting markup accurately. We anticipate this automation will greatly speed the indexing process to the point where rapid, accurate indexing of large full-text documents for electronic publishing becomes feasible.

Impact: Enabling clinicians to perform more precise full-text information retrieval is an essential component of effective health care delivery. Increasing the speed and efficiency with which health care practitioners access medical information will result in better, faster patient care.