SemCat: Semantically Categorized Entities for Genomics

Journal List > AMIA Annu Symp Proc > v.2006; 2006

AMIA Annu Symp Proc. 2006; 2006: 754–758.

PMCID: PMC1839293

Copyright This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose.

SemCat: Semantically Categorized Entities for Genomics

Lorraine Tanabe, PhD, Lynne H. Thom, PhD,^† Wayne Matten, PhD,^† Donald C. Comeau, PhD, and W. John Wilbur, MD, PhD

National Center for Biotechnology Information (NCBI)National Library of Medicine, Bethesda, MD 20894

^† Consolidated Safety Services, Inc.Scientific Support Division, Fairfax, VA 22030

Abstract

We describe the construction of a semantic database called SemCat consisting of a large number of semantically categorized names relevant to genomics. SemCat can be used to facilitate natural language processing in MEDLINE. We present suitable application areas including biomedical name classification and named entity recognition.

INTRODUCTION

Natural language processing (NLP) in the biomedical domain requires knowledge-rich sources of domain information. The Unified Medical Language System (UMLS®) Semantic Network [1, 2] can provide a solid framework on which to build biomedical subdomain-specific resources for genomics NLP. We have taken this approach and constructed the SemCat database, based on a subset of the UMLS Semantic Network enriched with categories from the GENIA Ontology [3], and a few new semantic types. SemCat contains over 5 million entities compiled from knowledge sources including the UMLS, GENIA, UniProt [4], the Gene Ontology (GO) [5], Entrez Gene [6], ProtScan [7], ChemID [8], the NCBI taxonomy database [9], the Brown corpus [10], the Wall Street Journal corpus [11], the Candida Genome Database [12], WormBase [13], Fly-Base [14], the Saccharomyces cerevisiae Database [15], and others [16].

Many users have modified the UMLS Semantic Network for their own research. For example, Yu et al. [17] found that it was missing critical components in the genomics domain, and added six new semantic types including Protein Structure and Chemical Complex. Zhang et al. [18] found that new links between semantic types were necessary, and constructed the Enriched Semantic Network (ESN) using a multiple subsumption directed acyclic graph. In this paper, we use the Semantic Network as a framework for the categorization of named entities in MEDLINE.

METHODOLOGY

We found that a subset of the UMLS Semantic Network would be sufficient for gene and protein name classification, and added a few new semantic types for better coverage. We shifted some semantic types from suboptimal nodes to ones that made more sense from a genomics standpoint. The resulting SemCat Physical Object hierarchy is shown in Figure 1. Similar hierarchies exist for Event and Conceptual Entity. Example coverage of sizeable SemCat semantic types is given in Table 1. Currently, SemCat encompasses 77 semantic types, and 5.11M non-unique entries.

Figure 1

SemCat Physical Object Hierarchy. White = UMLS; Light grey = GENIA; Dark grey = NEW.

Table 1

Knowledge sources of the largest Semantic Types in SemCat. ATCC = American Type Tissue Collection; GO = Gene Ontology; Patterns = Regular Expressions; WWW = website data.

Pattern Matching Our original motivation for constructing SemCat was to compile training data for machine learning algorithms for biomedical named entity recognition (NER). A certain level of circularity was unavoidable - in order to build programs to tag named entities, we needed a database of tagged named entities. Our goal then was to rapidly expand SemCat with additional named entities from MEDLINE, without using sophisticated natural language processing. Using domain expertise, we manually generated 205 noun phrase “indicator” patterns (Table 2), and extracted 402K MEDLINE terms for 37 SemCat types. The patterns were designed to be as unambiguous as possible. For example, in the pattern “X cells,” X can refer to a gene (“p53 cells”), but not in “parental X cells.” After applying a filter for mismatched parentheses and generic terms, and requiring at least one noun to be present, we retained 10K entities not yet in SemCat.

Table 2

Indicator patterns for additional named entities in MEDLINE.

Generic Entity Filter Many SemCat entities are non-specific; hence they are less useful for natural language processing. For example, in protein interaction extraction, “protein inhibits gene” is uninformative, whereas “p53 inhibits MDM2” is useful. To flag these terms in SemCat, lists of generic entities were manually compiled for non-gene-related SemCat types. Gene-related generic entities were generated using a probabilistic context-free grammar (PCFG), followed by manual inspection (Figure 2). A PCFG is a statistical language model. The generic entity lists are used to filter SemCat as follows (L represents a list. L = G for gene-related entities):

Figure 2

Generic gene-related entity identification. The PCFG was trained on SemCat to recognize gene and protein names.

If an entity is an exact match to a phrase in L, mark it as generic.
If an entity consists entirely of terms in L, and is at most two words long, save it as generic (*.gen).
If an entity consists entirely of terms in L, and is more than two words long, save it as possibly generic (*.mgen).
If an entity matches a regular expression for generic entities, save it as generic.
Otherwise, save the term as specific.

Using this method, SemCat entities are subcategorized into generic (*.gen), possibly generic (*.mgen) and specific (*.spec) subsets (examples shown in Table 3).

Table 3

Examples of SemCat entities automatically subcategorized into generic (*.gen) and possibly generic (*.mgen) subsets.

Interannotator Agreement on Missing Entities SemCat is by no means a comprehensive set of biomedical entities in MEDLINE. To increase the coverage of MEDLINE terms in SemCat, we extracted 9,323 terms that occur frequently in MEDLINE, but do not co-occur strongly with Sem-Cat terms, for manual curation. Annotation was based on the first five abstracts retrieved by a Pub-Med search.

Due to the number of categories (154 from 77 types, each with a GENERIC option), we expected interannotator agreement to be low. We studied 100 terms using the “key-to-response” method, where one annotator’s tags serve as a key against which the others are evaluated (see Table 4). We found that removing the GENERIC option improved interannotator agreement. We find that most of the categorizations make sense, and reflect the bias of the annotator’s biological background.

Table 4

Interannotator agreement (F-score) using the first column annotator as the key for each row. Annotator #1 - Medicine, #2 - Molecular Biology, #3 - Genetics, #4 - Biochemistry. Shaded scores do not use the GENERIC prefix.

For example, consider the tags provided for the term absorbance in Table 5. This apparent lack of agreement actually reflects the different semantic senses of absorbance in biomedical text. Several decades of research on interannotator consistency in information retrieval have produced values of indexing consistency in this range (35–45% for experienced indexers using controlled vocabularies) [19]. The overall consistency for MEDLINE headings, subheadings and identifiers was reported to be 34% [20]. Final categorization can be done by either a simple voting procedure or by allowing all possible categorizations by all annotators to capture biomedical subdomain terminological senses and level of ambiguity.

Table 5

Interannotator agreement example. Annotators #1–4 are identical to those in Table 4.

APPLICATIONS

Models for Named Entities We used SemCat as training data to investigate named entity classification techniques. We generated a statistical language model and probabilistic context-free grammar (PCFG) for gene and protein name classification. The SemCat-trained language model achieved F-values (the harmonic mean of Precision and Recall) of 0.944, 0.945 and 0.943, and the PCFG achieved F-values of 0.952, 0.952 and 0.952 using three-fold cross validation.

Named Entity Recognition SemCat can be used to improve the results of biomedical NER systems. Specifically, SemCat entities can be used as gazetteers (alphabetic descriptive lists), which have proven to be useful in biomedical NER [21–23]. At BioCreative 2004, the systems with 80% or higher F-scores had post-processing stages using gazetteers [24]. It is straightforward to combine several SemCat types into a single gazetteer, which can be customized for named entity definitions. In BioCreative Task 1A, the definition of a gene/protein entity was broad [25], therefore, many gene- and protein-related SemCat entities can be combined into a useful gazetteer for BioCreative-type tasks. For other NER tasks, finer-grained gazetteers can be constructed.

CONCLUSION

We have presented the SemCat database of biomedical entities, which is based on a genomics-rich subset of the UMLS Semantic Network. SemCat contains over 5M biomedical entities, and is being supplemented with additional expertly-annotated MEDLINE terms. We have shown that SemCat can be used for training, testing and evaluating machine learning algorithms, and anticipate that it will be useful for biomedical NER, word sense disambiguation and semantic interpretation. SemCat can facilitate biomedical text mining by providing an entry point into the UMLS Semantic Network for many named entities in MEDLINE. This link makes much of the functionality of the UMLS Semantic Network, including semantic relationships and hierarchical structure, immediately accessible to SemCat entities in MEDLINE.

AVAILABILITY

SemCat flat files are available at: ftp.ncbi.nlm.nih.gov/pub/tanabe/SemCat/.

This is a smaller version of SemCat (4.56M entities) due to licensing issues.

ACKNOWLEDGEMENTS

This research was supported in part by the Intramural Research Program of the NIH, National Library of Medicine. We thank Katie Grossman and Luis Martarano for annotation, and Natalie Xie for the annotation web interface.

REFERENCES

Lindberg, DAB; Humphreys, BL; McCray, AT. The unified medical language system. Methods of Information in Medicine. 1993;32:281–291. [PubMed]

McCray, AT; Nelson, SJ. The representation of meaning in the umls. Methods of Information in Medicine. 1995;34(1–2):193–201.

Kim, J-D; Ohta, T; Tateisi, Y; Tsujii, J-i. Genia corpus--semantically annotated corpus for bio-textmining. Bioinformatics. 2003;19(Suppl 1):i180–2. [PubMed]

Bairoch, A, et al. The universal protein resource (uniprot). Nucleic Acids Res. 2005;33(D):154–159.

Gene-Ontology-Consortium, T. Gene ontology: Tool for the unification of biology. Nature Genetics. 2000 May;25:25–29. [PubMed]

Maglott, D; Ostell, J; Pruitt, KD; Tatusova, T. Entrez gene: Gene-centered information at ncbi. Nucleic Acids Res. 2005;33:D54–8. [PubMed]

Egorov, S; Yuryev, A; Daraselia, N. A simple and practical dictionary-based approach for identification of proteins in med-line abstracts. J Am Med Inform Assoc. 2004;11(3):174–178. [PubMed]

Wexler, P. The U.S. National Library of Medicine's toxicology and environmental health information program. Toxicology. 2004;198(1–3):161–8. [PubMed]

Benson, DA; Karsch-Mizrachi, I; Lipman, DJ; Ostell, J; Wheeler, DL. Genbank. Nucleic Acids Res. 2003;31(1):23–7. [PubMed]

10.

Francis, W; Kucera, H. Frequency analysis of english usage: Lexicon and grammar. Boston, MA: Houghton Mifflin; 1982.

11.

Marcus, M; Santorini, B; Marcinkiewicz, M. Building a large annotated corpus of english: The penn treebank. Computational Linguistics. 1993;19:313–330.

12.

Arnaud, M, et al. The candida genome database (cgd), a community resource for candida albicans gene and protein information. Nucleic Acids Res. 2005;33:D358–63. [PubMed]

13.

Schwarz, EM, et al. Wormbase: Better software, richer content. Nucleic Acids Research. 2006;34:D475–D478.

14.

Drysdale, RA; Crosby, MA; Consortium, TF. Flybase: Genes and gene models. Nucleic Acids Research. 2005;33:D390–D395. [PubMed]

15.

Balakrishnan, R, et al. Fungal blast and model organism blastp best hits: New comparison resources at the saccharomyces genome database (sgd). Nucleic Acids Res. 2005;33:D374–7. [PubMed]

16.

Krause,, R; Mering, Cv; Bork, P. A comprehensive set of protein complexes in yeast: Mining large scale protein-protein interaction screens. Bioinformatics. 2003;19(15):1901–8. [PubMed]

17.

Yu, H; Friedman, C; Rzhetsky, A; Kra, P. Representing genomic knowledge in the umls semantic network. Proc AMIA Symp. 1999:181–5. [PubMed]

18.

Zhang, L; Perl, Y; Halper, M; Geller, J; Cimino, JJ. An enriched unified medical language system semantic network with a multiple subsumption hierarchy. J Am Med Inform Assoc. 2004;11(3):195–206. [PubMed]

19.

Saracevic, T. Proceedings of the 54th Annual ASIS Meeting. Washington, D.C: Learned Information, Inc; 1991. Individual differences in organizing, searching, and retrieving information.

20.

Funk, ME; Reid, CA; McGoogan, LS. Indexing consistency in med-line. Bulletin of the Medical Librarians Association. 1983;71(2):176–183.

21.

Kinoshita, S; Cohen, KB; Ogren, PV; Hunter, L. Biocreative task1a: Entity identification with a stochastic tagger. BMC Bioinformatics. 2005;6(Suppl 1):S4. [PubMed]

22.

Finkel, J, et al. Exploring the boundaries: Gene and protein identification in biomedical text. BMC Bioinformatics. 2005;6(Suppl 1):S5. [PubMed]

23.

McDonald, R; Pereira, F. Identifying gene and protein mentions in text using conditional random fields. BMC Bioinformatics. 2005;6(Suppl 1):S6. [PubMed]

24.

Yeh, A; Morgan, A; Colosimo, M; Hirschman, L. Biocreative task 1a: Gene mention finding evaluation. BMC Bioinformatics. 2005;6(Suppl 1):S2. [PubMed]

25.

Tanabe, L; Xie, N; Thom, LH; Matten, W; Wilbur, WJ. Genetag: A tagged corpus for gene/protein named entity recognition. BMC Bioinformatics. 2005;6(Suppl 1):S3. [PubMed]

Articles from AMIA Annual Symposium Proceedings are provided here courtesy of
American Medical Informatics Association