Generation of a Large Gene/Protein Lexicon by Morphological Pattern Analysis Lorrie Tanabe 11:00 am Tuesday September 30 5th Floor Conference Room The identification of gene/protein names in biomedical text is an important named entity recognition (NER) problem. It is a necessary first step towards semantic analysis and text mining, and can also aid in information retrieval. Dictionaries can improve the performance of gene/protein NER, however, currently available lexicons are incomplete. In previous work we have processed MEDLINE documents to obtain a collection of over 2 million names of which we estimate two thirds are valid gene/proteins. This talk focuses on an approach to purify this large set of names based on the generation of classes of names with common morphological features. Within each class, inductive logic programming (ILP) is applied to learn which features are most predictive of gene/protein names. 193 classes and ILP theories were generated and applied to the 2 million names defining a subset of 1.2 million names. A simple false positive filter eliminated 8% of these, leaving 1.14 million entries in the final lexicon, which is composed of 82% (+/- 3%) complete and accurate gene/protein names, 12% names related to genes/proteins (generic names, partial names, etc.), and 6% names unrelated to genes/proteins.