MetaMap Variant Generation

MetaMap Variant Generation

MetaMap Transfer
(MMTx)

Rosetta Stone

Home

Documentation

Prerequisites

2.4.A Prerequisites

Resources

Download
(Restricted)

Install

Run MMTx

Customize

Trouble Reporter

Review Status
of Trouble Reports

FAQ

Statistics

User's Group
Notes

Administration
(Restricted)

MetaMap Variants are variants of words and terms that have been computed from a variety of variant generation methods. This mixture includes recursively defined derivational variants, recursively defined synonyms, acronyms and abbreviations, acronym and abbreviation expansions, spelling variants, and useful combinations of all of the above methods. All inflectional variants of the resulting variants are also computed.

The options include the following:

Specifying Options,

Filter Options,

Input Format Description,

Output Format Description,

Batch/ Multi Processer/Fault Tolerant Operation,

Non Restarting Single Processer Operation,

Scripts/Mains/Config Files involved,

Options available to the scripts and programs,

Options exclusively for the PGvMonitor Program,

Options exclusively for the PGvWorker Program,

Options exclusively for the GenerateMMVariants Program,

The prerequisites to run this program include:

CLASSPATH The classpath needs to include the installation path to $MMTx/mmtx/classes where $MMTX is the top level directory that contains the MMTx software.

The classpath needs to include the installation path to $MMTx/mmtx/config directory.

The classpath needs to include the path to MySQL's jdbc driver jar file, (or the path to your version of the jdbc driver, if different than MySQL). $MMTx/mmtx/mm.mysql.jdbc-1.2c directory.

(There exists an example cshrc with the correct class paths set in the $MMTX/mmtx/config directory).

These programs require that the the DFBuilder option be installed when MMTx is installed.
These programs rely on the Java Lexical tools. These tools are installed as part of the DFBuilder installation. These programs rely on the text file pointed to by the --inflectionTable variable. The inflected term row from this table is used to gleen terms from the lexicon. It should be noted that the information from the lexical tools and this inflection table should be checked to be in sync. The source of both these data sources should be the same.

There is an ordering to options seen. Options from the mmtx.cfg override the default options from the mmtxRegistry.cfg. Options seen via a second configuration file specified on the command line via the --configName= option overides settings from the Repository configuration as well as the standard config file. Options seen from the command line overide all. It should be noted that the editable GenerateMMVariants.cfg is provided to have options specific to the variant generation process be put in there rather than having to specify possibly numerous options on the command line each time one of this program is run.

Option Hierarchy:

User specified command line options (highest precedence) which overwrite,

Command line specified configuration file(s) which overwrite,

$MMTx/config/mmtx.cfg which overwrite,

$MMTx/config/mmtxRegistry.cfg (lowest precedence).

Filter Options

The program generates all variants. Included in the output are flags that indicate whether or not a particular filter would throw out this term.

There is an option to filter the resulting list back to the input set of words, effectively making this a filter to target flow.

Some of these variants are more aggressive than others. You may want to filter to include only derivational variants that are of the variety that are noun/adj transformations. This form of derivations is thought to be more conservative, and less likely to include a drastic meaning shift.

Ambiguous acronyms and abbreviations are known to cause many infalicities. You may want to filter out all ambiguous acronym/abbreviation variants. That is, only include these variants if only one acronym/abbreviation exists for a given input.

Input Format Description

The program expects to find a file called "all_words.sorted" in the directory where this program is kicked off. The option --fileName= can be used to indicate another file to be used as the input file. This input file should contain one term per line.

Terms from the lexicon are always added to the input list. These terms come from the inflected term field from the source data file pointed to by the --inflectionTable variable. By default, this table is data/lexicon/inflStatic200XLexicon.

Output Format Description

The output from this program is a table that includes the following fields:
|Column Name |Column Type |Description +----------------+--------------+------------------------------------------------- | term | varchar(250) |This is an input term. | termCategory | varchar(5) |This is the string represenation of | | |of the part of speech. +----------------+--------------+------------------------------------------------- | variant | varchar(250) |The output variant +----------------+--------------+------------------------------------------------- | variantCat | varchar(5) |The string representation of the | | |variant category +----------------+--------------+------------------------------------------------- | distance | int(11) |This is the numeric sum of | | |each transformation from the | | |orig term to the variant. +----------------+--------------+------------------------------------------------- | history | varchar(10) | +----------------+--------------+------------------------------------------------- | foundInvars | tinyint(4) | 1 if this term is found in the unfiltered table. +----------------+--------------+------------------------------------------------- | foundInvarsAN | tinyint(4) | 1 if this term is found in varsAN | | | and | | | 0 if this term is filtered out by | | | the filter that filters out | | | non noun/adj derivational variants . +----------------+--------------+------------------------------------------------- | foundInvarsANU | tinyint(4) | 1 if this term is found in varsANU | | | and | | | 0 if this term is filtered out by the | | | filter that filters out non noun/adj | | | derivational variants and filters out | | | non unique acronyms or their expansions. +----------------+--------------+------------------------------------------------- | foundInvarsU | tinyint(4) | 1 if this term is found in varsU | | | 0 if this term is filtered out by the | | | filter that filters out non noun/adj | | | derivational variants and filters out | | | non unique acronyms or their expansions. +----------------+--------------+-------------------------------------------------

where the history and distance are noted as
|Lvg's(*)|MMTx's | |history |history | Operation |Notation|Notation|Distance score -------------------------+--------+--------+--------------- No Operations | n|NULL |0 Spelling Variant | s|p |0 Inflectional Variant | i|i |1 Uninflectional Variant| b|i |1 Synonym | y|s |2 Acronym/Abbreviation | A|A |2 Expansion | a|a |2 Derivational Variant | d|d |3 -------------------------+--------+--------+---------------

* Being noted here for reference sake. Lvg's -f:G flow is used to generate the variants. This program translates between the two translations.

Batch/Multi Processer/Fault Tolerant Operation

It is expected that the variant generation process will process word lists of the order of 150,000 to 300,000 input inflected words. The generation of variants is a slow process. Thus, there are two ways to run this process: a program that runs through the entire list without stopping, and batch scripts that break up the file into smaller files, then works on the small files. The advantage of the batch method is that the process can be re-started from a point where it left off, if the need arises. This is useful for processes that take days to do, where external influences (network, disk space, power) may kill the process. On Windows machines, the scripts are run sequentially. On non-Windows machines, the PGvWorker programs and the monitoring program are run in background (using the & on the command line). Thus, on non-Windows machines, the two Worker programs will run concurrently. If there is a need to do the same on the Windows platform, the scripts should be hand edited to add the "&" command line argument.
./mmtx/bin/GenerateMMVariants[.bat] and ./mmtx/bin/GenerateMMVariantsContinue[.bat]
Use the ./mmtx/bin/GenerateMMVariantsContinue.[bat] to restart or continue running the process until it completes.

The GenerateMMVariants batch scripts are composed of the following commands
PGvMonitor: Breaks the input file down to N number of smaller files. This number is set via the mmtxRegistry.cfg variable --numberOfFiles. The default number of files has been set to 55 files. These input files are put in the ./subInputDir directory. The input files are labeled SubFile0 thru SubFile[N-1] Each input file contains rows of the form term|-2147483632 PGvWorker: This program processes the files from the ./subInputDir directory. It places its interim results in the ./subOutputDir directory. This program will remove each ./subInputDir/ file when it has successfully completed the entire file. The --even option indicates that the program pick up and process even numbered files. Otherwise, this program will pick up the odd numbered files. It is necessary that the PGvWorker program be called twice, once with the --even option and once without this option to successfully process all files. PGvMonitor --continue: This command monitors the processes until there are no longer any files in the ./subInputDir. Once this condition exists, the monitor program will concatinate the files from the ./subOutputDir and place the results in the file ./fullVars_[parentDirName].txt

Non Restarting Single Processer Operation

The variant generation process can also be kicked off via the script ./mmtx/bin/GenerateMMVariantsI[.bat]. This script calls the program GenerateMMVariantsI, which processes the input file and puts its output int the file ./fullVars_[parentDirName].txt.

Since no multi-processing is involved with this script, there are a couple of options available to this script not available with the batch version. Specifically, both the input and output can be resigned to standard input or output via the --stdInput and --stdOutput options. These options allow one to interatively generate variants.

Scripts/Mains/Config Files involved

Scripts:
mmtx/bin/GenerateMMVariants[.bat]
mmtx/bin/GenerateMMVariantsContinue[.bat]
mmtx/bin/GenerateMMVariantsI[.bat]

Java Main Programs referenced in the scripts:
mmtx/programs/GenerateMMVariantsI
mmtx/variantGeneration/PGvMonitor
mmtx/variantGeneration/PGvWorker

Config files involved:
mmtx/config/mmtxRegistry.cfg
mmtx/config/GenerateMMVariants.cfg

Options available to the scripts and programs

Long Name Short Name Description

--help ___ Prints out the help

--fileName= ___ The full path to an ascii input file containing one term per line.

--customVariantsTableName= ___ Specify the name of the table to be produced. This name will subsequently become the name of the table in the database. It will become the name of the file that contains the data. The default is to name the table the name of the parent directory.

--toDB ___ Put the result into the variants database table. The name of the table will be fullVars, placed in the DB_[YEAR][CustomizationTag]_Lexicon database

--an_derivational_variants -D Use only adjective/noun derivational variants.

--unique_acros_abbrs_only -u Use only unique acronyms and abbreviations.

--filterToANU -Du Filter the output to exclude the above two filters.

--no_acros_abbrs -a Filter the output to exclude acronyms.

--no_derivational_variants -d Filter the output to exclude derivations.

--LexiconSource= ___ The full path to a customized lexicon inflection.txt table.

--filterToTarget ___ Filter out any generated variants that are not on the input word list

--all_words.sorted= ___ Alter the name of the default input file.[This has the same effect as the --fileName= option].

Options exclusively for the PGvMonitor Program

Long Name Description

--numberOfFiles Number of files to break the input file and words from the lexicon into.

--continue This option monitors the status of the worker process(es) and will concatinate the output files when there are no more input files available to be processed by the worker process(es).

Options exclusively for the PGvWorker Program

Long Name Description

--even Process files from the ./subInputDir that have an even number appended to them. The default is to only process the odd numbered files from the ./subInputDir directory.

Options exclusively for the GenerateMMVariants Program

Long Name Description

--toStdOut Send the generatd variants to standard output.

--fromStdIn Take generating terms from standard input.

Last Modified: March 30, 2007

ii-public

Links to Our Sites

MetaMap Public Release

NEW: Distributable version of the actual MetaMap program.

Indexing Initiative (II)

Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).

Semantic Knowledge Representation (SKR)

Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.

MetaMap Transfer (MMTx)

Java-Based distributable version of the MetaMap program.

Word Sense Disambiguation (WSD)

Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.

Medline Baseline Repository (MBR)

Static MEDLINE Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.

Lister Hill National Center for Biomedical Communications

U.S. National Library of Medicine

National Institutes of Health

Department of Health and Human Services