From: Won Kim [wonkim@topaz.nlm.nih.gov] Sent: Friday, December 17, 1999 1:21 PM To: ncbi-seminar@topaz.nlm.nih.gov Cc: wonkim@topaz.nlm.nih.gov Subject: CBB seminar 11 am, Tuesday, Dec. 21 In Bldg. 38A, 8th floor conference room at 11 am, Tuesday, Dec. 21. Title: Identification of Content Bearing Phrases by Won Kim and W. John Wilbur Abstract:Linking MEDLINE phrases to the subsections of Molecular Biology of the Cell (published by Garland Publishing Inc) is now available on the web. The above project prompts us to study how to extract content bearing phrases from MEDLINE records. Thus high content bearing phrases are marked in MEDLINE documents. Each marked phrase may serve as a hyperlink to other text. First we briefly review our previous work on how to extract useful phrases. Here the emphasis was to find useful phrases so that one can find them in UMLS but they are not necessarily content bearing phrases. Next, we examine statistical techniques for identifying content bearing phrases from a natural language database. We develop three different methods to score the phrases so that if a phrase carries the high content, hopefully that phrase receives high score. We sampled 1,002 high scoring phrases from previous scoring methods and asked the human expert to rate them. Using his/her rating, we measure the performance of each scoring method and obtain the optimal scoring method combining all three methods.