Rosetta Stone
|
|
|
|
MetaMap Variants are variants of words and terms that have been computed
from a variety of variant generation methods. This mixture includes
recursively defined derivational variants, recursively defined synonyms,
acronyms and abbreviations, acronym and abbreviation expansions, spelling
variants, and useful combinations of all of the above methods. All
inflectional variants of the resulting variants are also computed.
The options include the following:
- Specifying Options,
- Filter Options,
- Input Format Description,
- Output Format Description,
- Batch/
Multi Processer/Fault Tolerant Operation,
- Non Restarting
Single Processer Operation,
- Scripts/Mains/Config
Files involved,
- Options
available to the scripts and programs,
- Options
exclusively for the PGvMonitor Program,
- Options
exclusively for the PGvWorker Program,
- Options exclusively for the GenerateMMVariants Program,
The prerequisites to run this program include:
CLASSPATH |
The classpath needs to include the installation path to
$MMTx/mmtx/classes where
$MMTX is the top level directory that contains the MMTx software.
The classpath needs to include the installation path to
$MMTx/mmtx/config
directory.
The classpath needs to include the path to MySQL's jdbc driver
jar file, (or the path to your version of the jdbc driver, if
different than MySQL).
$MMTx/mmtx/mm.mysql.jdbc-1.2c directory.
(There exists an example cshrc with the correct class paths set
in the $MMTX/mmtx/config directory).
|
These programs require that the the DFBuilder option be installed
when MMTx is installed.
These programs rely on the Java Lexical tools. These tools are
installed as part of the DFBuilder installation. These programs
rely on the text file pointed to by the --inflectionTable variable.
The inflected term row from this table is used to gleen terms from
the lexicon. It should be noted that the information from the
lexical tools and this inflection table should be checked to be
in sync. The source of both these data sources should be the same.
|
There is an ordering to options seen. Options from the mmtx.cfg override
the default options from the mmtxRegistry.cfg. Options seen via a second
configuration file specified on the command line via the
--configName= option overides settings from the Repository
configuration as well as the standard config file. Options seen from
the command line overide all. It should be noted that the
editable GenerateMMVariants.cfg is provided to have options specific to
the variant generation process be put in there rather than having to
specify possibly numerous options on the command line each time one of
this program is run.
Option Hierarchy:
- User specified command line options (highest precedence)
which overwrite,
- Command line specified configuration file(s) which overwrite,
- $MMTx/config/mmtx.cfg which overwrite,
- $MMTx/config/mmtxRegistry.cfg (lowest precedence).
Filter Options
The program generates all variants. Included in the output are flags
that indicate whether or not a particular filter would throw out
this term.
There is an option to filter the resulting list back to the input
set of words, effectively making this a filter to target flow.
Some of these variants are more aggressive than others. You may want
to filter to include only derivational variants that are of the
variety that are noun/adj transformations. This form of derivations
is thought to be more conservative, and less likely to include a
drastic meaning shift.
Ambiguous acronyms and abbreviations are known
to cause many infalicities. You may want to filter out all ambiguous
acronym/abbreviation variants. That is, only include these variants
if only one acronym/abbreviation exists for a given input.
Input Format Description
The program expects to find a file called "all_words.sorted" in the
directory where this program is kicked off. The option
--fileName= can be used to indicate another file to be used
as the input file. This input file should contain one term per line.
Terms from the lexicon are always added to the input list. These
terms come from the inflected term field from the source data file
pointed to by the --inflectionTable variable. By default, this
table is data/lexicon/inflStatic200XLexicon.
Output Format Description
The output from this program is a table that includes the following
fields:
|Column Name |Column Type |Description
+----------------+--------------+-------------------------------------------------
| term | varchar(250) |This is an input term.
| termCategory | varchar(5) |This is the string represenation of
| | |of the part of speech.
+----------------+--------------+-------------------------------------------------
| variant | varchar(250) |The output variant
+----------------+--------------+-------------------------------------------------
| variantCat | varchar(5) |The string representation of the
| | |variant category
+----------------+--------------+-------------------------------------------------
| distance | int(11) |This is the numeric sum of
| | |each transformation from the
| | |orig term to the variant.
+----------------+--------------+-------------------------------------------------
| history | varchar(10) |
+----------------+--------------+-------------------------------------------------
| foundInvars | tinyint(4) | 1 if this term is found in the unfiltered table.
+----------------+--------------+-------------------------------------------------
| foundInvarsAN | tinyint(4) | 1 if this term is found in varsAN
| | | and
| | | 0 if this term is filtered out by
| | | the filter that filters out
| | | non noun/adj derivational variants .
+----------------+--------------+-------------------------------------------------
| foundInvarsANU | tinyint(4) | 1 if this term is found in varsANU
| | | and
| | | 0 if this term is filtered out by the
| | | filter that filters out non noun/adj
| | | derivational variants and filters out
| | | non unique acronyms or their expansions.
+----------------+--------------+-------------------------------------------------
| foundInvarsU | tinyint(4) | 1 if this term is found in varsU
| | | 0 if this term is filtered out by the
| | | filter that filters out non noun/adj
| | | derivational variants and filters out
| | | non unique acronyms or their expansions.
+----------------+--------------+-------------------------------------------------
where the history and distance are noted as
|Lvg's(*)|MMTx's |
|history |history |
Operation |Notation|Notation|Distance score
-------------------------+--------+--------+---------------
No Operations | n|NULL |0
Spelling Variant | s|p |0
Inflectional Variant | i|i |1
Uninflectional Variant| b|i |1
Synonym | y|s |2
Acronym/Abbreviation | A|A |2
Expansion | a|a |2
Derivational Variant | d|d |3
-------------------------+--------+--------+---------------
* Being noted here for reference sake. Lvg's -f:G flow is used to generate
the variants. This program translates between the two translations.
Batch/Multi Processer/Fault Tolerant Operation
It is expected that the variant generation process will process word lists
of the order of 150,000 to 300,000 input inflected words. The generation
of variants is a slow process. Thus, there are two ways to run this process:
a program that runs through the entire list without stopping, and
batch scripts that break up the file into smaller files, then works on
the small files. The advantage of the batch method is that the process
can be re-started from a point where it left off, if the need arises.
This is useful for processes that take days to do, where external influences
(network, disk space, power) may kill the process. On Windows machines,
the scripts are run sequentially. On non-Windows machines, the PGvWorker
programs and the monitoring program are run in background (using the &
on the command line). Thus, on non-Windows machines, the two Worker
programs will run concurrently. If there is a need to do the same on
the Windows platform, the scripts should be hand edited to add the "&"
command line argument.
./mmtx/bin/GenerateMMVariants[.bat]
and
./mmtx/bin/GenerateMMVariantsContinue[.bat]
Use the ./mmtx/bin/GenerateMMVariantsContinue.[bat] to restart or continue
running the process until it completes.
The GenerateMMVariants batch scripts are composed of the following commands
PGvMonitor: Breaks the input file down to N number of smaller files.
This number is set via the mmtxRegistry.cfg variable
--numberOfFiles. The default number of files has been
set to 55 files.
These input files are put in the ./subInputDir directory.
The input files are labeled SubFile0 thru SubFile[N-1]
Each input file contains rows of the form
term|-2147483632
PGvWorker: This program processes the files from the
./subInputDir directory. It places its interim results
in the ./subOutputDir directory. This program will
remove each ./subInputDir/ file when it has
successfully completed the entire file.
The --even option indicates that the program pick up
and process even numbered files. Otherwise, this
program will pick up the odd numbered files.
It is necessary that the PGvWorker program be called
twice, once with the --even option and once without
this option to successfully process all files.
PGvMonitor --continue:
This command monitors the processes until there are
no longer any files in the ./subInputDir. Once this condition
exists, the monitor program will concatinate the files
from the ./subOutputDir and place the results in the
file ./fullVars_[parentDirName].txt
Non Restarting Single Processer Operation
The variant generation process can also be kicked off via the script
./mmtx/bin/GenerateMMVariantsI[.bat]. This script calls the program
GenerateMMVariantsI, which processes the input file and puts its
output int the file ./fullVars_[parentDirName].txt.
Since no multi-processing is involved with this script, there are
a couple of options available to this script not available with the
batch version. Specifically, both the input and output can be
resigned to standard input or output via the --stdInput and --stdOutput
options. These options allow one to interatively generate variants.
Scripts/Mains/Config Files involved
Scripts:
mmtx/bin/GenerateMMVariants[.bat]
mmtx/bin/GenerateMMVariantsContinue[.bat]
mmtx/bin/GenerateMMVariantsI[.bat]
Java Main Programs referenced in the scripts:
mmtx/programs/GenerateMMVariantsI
mmtx/variantGeneration/PGvMonitor
mmtx/variantGeneration/PGvWorker
Config files involved:
mmtx/config/mmtxRegistry.cfg
mmtx/config/GenerateMMVariants.cfg
Options available to the scripts and programs
Long Name |
Short Name |
Description |
--help |
___ |
Prints out the help |
--fileName= |
___ |
The full path to an ascii input file containing one
term per line. |
--customVariantsTableName= |
___ |
Specify the name of the table to be produced. This name
will subsequently become the name of the table in the
database. It will become the name of the file that contains
the data.
The default is to name the table the name of the parent
directory. |
--toDB |
___ |
Put the result into the variants database table.
The name of the table will be fullVars, placed in the
DB_[YEAR][CustomizationTag]_Lexicon database |
--an_derivational_variants |
-D |
Use only adjective/noun derivational variants. |
--unique_acros_abbrs_only |
-u |
Use only unique acronyms and abbreviations. |
--filterToANU |
-Du |
Filter the output to exclude the above two filters. |
--no_acros_abbrs |
-a |
Filter the output to exclude acronyms. |
--no_derivational_variants |
-d |
Filter the output to exclude derivations. |
--LexiconSource= |
___ |
The full path to a customized lexicon inflection.txt table. |
--filterToTarget |
___ |
Filter out any generated variants that are not on the
input word list |
--all_words.sorted= |
___ |
Alter the name of the default input file.[This has the
same effect as the --fileName= option]. |
Options exclusively for the PGvMonitor Program
Long Name |
Description |
--numberOfFiles |
Number of files to break the input file and words from
the lexicon into. |
--continue |
This option monitors the status of the worker process(es)
and will concatinate the output files when there are
no more input files available to be processed by the
worker process(es). |
Options exclusively for the PGvWorker Program
Long Name |
Description |
--even |
Process files from the ./subInputDir that have an
even number appended to them. The default is to
only process the odd numbered files from the
./subInputDir directory. |
Options exclusively for the GenerateMMVariants Program
Long Name |
Description |
--toStdOut |
Send the generatd variants to standard output. |
--fromStdIn |
Take generating terms from standard input. |
|