[I didn’t do the bullet right for the next line – it should be more to the left in the margin as in the current TC page.] · Text Categorization Java, UTF-8, 2007 Release: The TC (Text Categorization) project provides tools for high level categorization based on the JDI (Journal Descriptor Indexing) methodology. JDI tools automatically categorize biomedical text as input, returning a ranked list, with scores between 0-1, of either JDs (Journal Descriptors, corresponding to biomedical disciplines) or STs (UMLS® Semantic Types). Applications include categorization by JD as pre-processing of text for NLP (natural language processing) and WSD (word sense disambiguation) according to ST. JDI tools are based on research in the JDI project (link JDI project to: http://lhncbc.nlm.nih.gov/csb/CSBPages/JDIproject.shtml) where the tools were originally developed in Lisp. The tools have since been developed in JAVA as part of the TC project for public interactive use and distribution as an open-source package. Click here for JDI Interactive Tools. Click here to download the TC Package. Click here for description of the JDI methodology. The rest is the page linked from the last line above, and should have the TC header with TC graphic. It would be nice if the page contents could fit on one page (no scrolling needed). JDI Methodology The NLM (National Library of Medicine®) maintains two broad, relatively small classifications: • A set of 122 descriptors from MeSH® (Medical Subject Headings®), known as JDs (journal descriptors), used for manually indexing MEDLINE® journals per se according to discipline. These are found in the List of Journals Indexed for MEDLINE, which also contains the listing of titles under these descriptors. For example, Journal of Pediatric Surgery is listed under both Pediatrics and Surgery. • A set of 135 STs (semantic types) in the Semantic Network in NLM's UMLS (Unified Medical Language System®). Concepts in the UMLS Metathesaurus® are assigned one or more STs which semantically characterize those concepts. For example, the Metathesaurus concept Aspirin is assigned the STs Pharmacologic Substance and Organic Chemical. JDI uses a methodology based on statistical word-JD associations from a training set of MEDLINE citations to which are imported the JDs corresponding to journal unique identifiers in the citations. For example, words in articles in the Journal of Pediatric Surgery become statistically associated with the JDs Pediatrics and Surgery. Then an input text comprised of words similar to the ones in these articles would be categorized by the same JDs. Using words in the input, JDI ranks the JDs according to the average of JD scores in word-JD associations. For example, the first three JDs, with scores, returned by JDI for the input "appendectomy in children" are: 1 0.7311 Surgery, 2 0.6856 Pediatrics, and 3 0.4661 Gastroenterology. The JDI methodology is the basis for STI (Semantic Type Indexing). ST “documents” are created comprised of UMLS Metathesaurus strings belonging to the ST, and these documents each undergo JDI. Then statistical word-ST associations are calculated by comparing JDI of individual training set words and JDI of these ST documents. Using words in the input, STI ranks the STs according to the average of ST scores in word-ST associations. For example, the first three STs, with scores, returned by STI for the input "appendectomy in children" are: 1 0.5985 Age Group, 2 0.5520 Finding, and 3 0.5498 Therapeutic or Preventive Procedure. That is, the average Age Group score for words in the input is higher than for other STs. An alternate method of STI compares the JDI of the input to the JDI of each ST document, and ranks the STs according to the greatest similarity to their ST documents. By this method, JDI of this input is most similar to JDI of the Age Group document. JDI and STI have actual and potential applications, in particular embedded in programs in the SKR (Semantic Knowledge Representation) project (link SKR (Semantic Knowledge Representation) to: http://skr.nlm.nih.gov). For example, JDI is being used by SemRep (link SemRep to: http://skr.nlm.nih.gov/papers/index.shtml#SemRep), an NLP program; JDI increases accuracy by identifying MEDLINE citations in the molecular genetics domain before NLP begins. STI has been applied to WSD. If the senses of an ambiguous word are expressed by candidate STs for its meaning, STI can be performed on the context surrounding the word (phrase, sentence, abstract) in the expectation that in the STI of the context, the correct ST for the word will rank higher than the other candidate STs. STI is being evaluated to do WSD in MetaMap (link MetaMap to: http://skr.nlm.nih.gov/papers/index.shtml#MetaMap).