Availability: http://www.ebi.ac.uk/tc-test/textmining/medevi/
Contact: kim/at/ebi.ac.uk
zik, P.
Bioinformatics. 2008 June 1; 24(11): 1410–1412. Published online 2008 April 9. doi: 10.1093/bioinformatics/btn117. | PMCID: PMC2387223 |
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Availability: http://www.ebi.ac.uk/tc-test/textmining/medevi/
Contact: kim/at/ebi.ac.uk
When exploring biomedical literature for information relevant to our research, we heavily rely on search engines (e.g. PubMed) which deliver us documents that match keyword-based queries. In the case of a query consisting of multiple keywords or terms, there is a need for restricting positional distance between occurrences of the terms in a document. If the terms are found too far from each other in the text, it is very likely that the text does not, at least not explicitly by means of the terms given, describe any relationship between concepts denoted by the terms. We regard this positional restriction as crucial in seeking relational information, for example, when users attempt to find textual evidence of relations between given concepts in the literature. We provide a novel tool to address this need with a special focus on the biomedical domain.
The tool presented here, named MedEvi, is a search engine that retrieves occurrences matching a given query with their local context. It is inspired by keywords-in-context (KWIC) concordancers, which have over the last few decades revolutionized the field of lexicography where different senses of lexical entries of dictionaries have to be defined in their authentic usage context (Sinclair, 1991). We believe that a concordancer is a good candidate to meet the above-mentioned tasks of information seeking, since it innately deals with the local context of matching occurrences where the evidence being searched is much more likely found than in other parts of the retrieved documents.
The common limitation of existing concordancers, however, is that they consider only single-term queries. To deal with multiple-term queries effectively, we implement the positional restriction on top of a concordancer. This feature of MedEvi is similar to the concept of proximity query (Baeza-Yates and Ribeiro-Neto, 1999), for example, as implemented in the proximity search of Lucene queries and the defined adjacency operator of OVID database queries. The difference between them is that while the latter is explicitly stated, if any, in query strings (e.g. ‘A ADJn B’), the former is compulsorily applied to all queries where the distance between query terms, similar to ‘n’ of ‘ADJn’, can be adjusted by users.
MedEvi allows multi-term queries, composed with BOOLEAN operators (e.g. AND, OR). It is different from other existing search engines that also allow multi-term queries [e.g. PubMed (http://www.ncbi.nlm.nih.gov/sites/entrez), HubMed (http://www.hubmed.org)]. While the other search engines produce as results a list of MEDLINE abstracts, MedEvi directly browses text fragments that may eventually show semantic relations between given terms. It is different from other text mining tools that also browse text fragments, mostly sentences [e.g. iHOP (http://www.ihop-net.org/UniPub/iHOP/), MEDIE (http://www-tsujii.is.s.u-tokyo.ac.jp/medie/)]. While the text mining tools focus on certain biological entities like proteins (iHOP) (Hoffman and Valencia, 2005) and certain grammatical structures like subject-verb-object (MEDIE), MedEvi does not impose any syntactic or semantic restrictions, thus being widely used in any biomedical domains. We explain the features of MedEvi in the next section.
Users of MedEvi have found the tool useful to find evidence from the literature, for example, to see whether candidate chemicals are involved in a metabolic pathway, to identify the proteins that regulate given proteins, and to find whether a multi-term ontology concept actually appears in the literature even with a high degree of syntactic variations. Note that the applications above are generally concerned of semantic relations between biomedical concepts. Selected example queries can be found on the web page of MedEvi.
MedEvi receives a query either through the standard user interface in the entry page or via the advanced user interface available. It retrieves MEDLINE abstracts relevant to the query by using an Apache Lucene index (http://lucene.apache.org) that covers the whole set of MEDLINE abstracts and is updated on a bi-monthly basis. It then outputs hypertext that consists of aligned occurrences matching the query with hyperlinks attached to additional candidate keywords. Figure 1 shows an example output with the top 10 occurrences of the query “(ada OR acrR) AND (activat* OR inhibit*)”.
MedEvi also recognizes query terms that are UniProt accession numbers (e.g. P06134 for ‘Ada’), and it automatically expands them to sets of synonymous terms, so that instead of specifying a set of names denoting a protein, one can use a UniProt accession number to locate strings associated with this accession number. The estimated precision and recall of the module for recognizing gene/protein names are 91.5% and 94%, respectively, when we accept nested terms as correct matches (Rebholz-Schuhmann et al., 2007).
MedEvi provides three links for each additional candidate keyword to help users expand their queries: a link to add the keyword to the old query, another to replace the old query with the new keyword, and the other to show information of the keyword from well-known databases (e.g. UniProt, Gene Ontology).
MedEvi is supplementary to existing search engines and text mining tools in the biomedical domain. It shows significant improvements in the presentation of results which offer new information seeking capabilities, by the combination of different search techniques such as concordance, positional restriction, semantic restriction and keyword lookup.
Medline abstracts are provided from the NLM (Bethesda, MD, USA) and PubMed (www.pubmed.org) is the premier Web portal to access the data. Antonio Jimeno Yepes contributed to the improvement of MedEvi's; semantic search capabilities. This work has been inspired by his contributions to the TREC Genomics Track competition 2007.
Funding: This research was sponsored by the EC STREP project ‘BOOTStrep’ (FP6-028099, www.bootstrep.org).
Conflict of Interest: none declared.