use NED::RLESearch;
$ary_ref = $rleh->search($query); $ary_ref = $rleh->search($query, $low_cutoff);
$hash_ref = $rleh->query; $ary_ref = $rleh->queryRecs; $ary_ref = $rleh->searchRecs; $str = $rleh->searchSQL; $ary_ref = $rleh->lowCutoff; $max_wgt = $rleh->maxWgt; $max_wgt = $rleh->maxWgt($query); ($max_wgt, $max_wvec_ref) = $rleh->maxWgt; ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);
$rleh = RLESearch->_new($dbi_handle); $self->_compareExact($q_index, $s_index, $m_prob, $u_prob); $self->_compareOrdinary($q_index, $s_index, $m_prob, $u_prob); $self->_compareSurname($q_index, $s_index, $m_prob, $u_prob); $self->_compareGivenName($q_index, $s_index, $m_prob, $u_prob); $self->_compareNumeric($q_index, $s_index, $m_prob, $u_prob); $self->_compareFreq($q_index, $s_index, \%AbsFreqTbl); $self->_addComparator(sub { ... }); $self->_compareAdj($q_index, $s_index, $m_prob, $u_prob, \@adj_curv, $fixed, $case, $exact); $self->_compareFreqAdj($q_index, $s_index, \%AbsFreqTbl, \@adj_curv, $fixed, $case, $exact); $self->_addComparator(\&sub); $wgt = w_strcmp($s1, $s2, $fixed, $case, $exact); $wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);
Record linking is easy in situations where a decision can be made based on the agreement or disagreement of a single attribute, for example, an employee identification number or the Social Security Number (SSN). However, it becomes more difficult when the records to be linked do not contain such an attribute, and the decision must be based either on a single attribute that may partially agree (such as a name) or several attributes of which only some may agree (such as organization, office address, and office telephone number).
The more difficult cases may be handled by applying probabilistic record linking. Briefly, the record linking engine RLESearch calculates a number, called a binit weight, which is the log base 2 of the odds that two records constitute a linked pair, i.e., that they belong to the same individual. Thus, a positive binit weight of, say, +10 indicates that the odds are about 1,000 to 1 in favor of a linkage, a negative binit weight of -10 indicates odds of about 1,000 to 1 against a linkage (an unlinked pair), and a binit weight of 0 indicates even odds in favor of (or against) a linkage.
The degree of certainty that a linkage is correct depends upon the comparisons of available attributes (or fields) of the records, and the outcomes of these comparisons. Generally, agreement between the values of an attribute in a pair of records argues in favor of accepting them as a linked pair, while disagreement of attribute values is characteristic of an unlinked pair.
However, agreement of various attributes and values have varying significance. For example:
It is more likely that two records that agree only on surname are linked
than two records that agree only on first name.
| |
It is more likely that two records that agree on surname = ``GORLEN'' are
linked than two records that agree on surname = ``SMITH''.
| |
It is more likely that two records with surnames = ``SMITH'' and ``SMITHE''
are linked than two records with surnames = ``SMITH'' and ``BROWN''.
|
When multiple comparisons involving various attributes and values are performed on a pair of records, the overall odds of correct linkage are calculated by simply multiplying together the odds of the individual comparisons. However, it is customary to express the odds as a binit weight, which is the log base 2 of the odds, and to then calculate the total binit weight by summing the binit weights of the individual comparisons.
Note that the representative set of linked pairs need not be large (a few hundred is sufficient to start with), and it need not be perfect.
A representative set of unlinked pairs is not required if simple comparisons are used, because the outcome frequencies can be calculated. Care must be taken when performing complicated (and more powerful) comparisons, which involve:
specific attribute values, e.g. ``GORLEN'' and ``SMITH''
| |
partial agreement of values, e.g. ``SMITH'' and ``SMITHE''
| |
comparisons of logically related identifiers, e.g. surnames agree given
that the SOUNDEX codes of the surnames agree
|
RLESearch is intended for applications where M x N < 1000, such as interactive searches of a database for matches to a single, manually-entered query record, or as a component of a meta-directory ``join rule'' for handling difficult cases.
RLESearch is designed for use as a base class. An application-specific subclass provides functions for comparing record attributes (called comparators) and the sets of query and search records. However, RLESearch does provide templates for some common comparators:
exact comparison, appropriate for short (< 4 character) identifiers, initials, and
acronyms
| |
approximate comparison, appropriate for names of people, streets, businesses, etc.
| |
specialized approximate comparison for surnames, given names, and numbers such as house numbers
| |
frequency-based approximate comparison, which adjusts for the decreased significance of
matches on commonly occuring attribute values such as ``SMITH'' compared to
rare values such as ``GORLEN''.
|
$ary_ref = $rleh->search($query); $ary_ref = $rleh->search($query, $low_cutoff);
Searches the connected database for probable records that match $query, and
returns the results as an array reference. $rleh
is a handle
returned by the new method of an RLESearch subclass.
The argument $query specifies a query as pairs of attributes and values and may be either a string or a hash reference. If a string, it must use Perl's syntax for initializing a hash, excluding the enclosing braces. Search converts a query string to upper case and evals it into a hash. If $query is a reference, search uses it as-is.
Pairs with binit weights less than or equal to $low_cutoff
are
discarded. If not specified, search uses a $low_cutoff
of 0.0.
Search returns results as an array reference. Each element of the result array is a reference to an array with the following elements:
[$wgt, $skey, \@wvec, \@qrec, \@srec]
where:
$hash_ref = $rleh->query;
Returns a reference to the last query hash from the last call to RLESearch::search.
$ary_ref = $rleh->queryRecs;
Returns a reference to an array of references to the query records from the last call to RLESearch::search.
$ary_ref = $rleh->searchRecs;
Returns a reference to an array of references to the search records from the last call to RLESearch::search.
$ary_ref = $rleh->lowCutoff;
Returns the low cutoff binit weight from the last call to RLESearch::search.
$max_wgt = $rleh->maxWgt; $max_wgt = $rleh->maxWgt($query); ($max_wgt, $max_wvec_ref) = $rleh->maxWgt; ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);
Returns the maximum possible binit weight for the query specified by $query (see search), or from the last call to RLESearch::search if $query is not supplied. This is the maximum weight obtained by comparing all the query records to themselves.
Calling maxWgt in array context also returns the individual comparator
weights, of which $max_wgt
is the sum.
$ary_ref = $self->_genQueryRecs(\%query);
Returns an array reference to an array of references to the query records corresponding to the argument \%query (see search).
$ary_ref = $self->_genSearchRecs(\%query);
Returns an array reference to an array of references to the records to be searched for linkage to the argument \%query (see search).
$self = $class->SUPER::_new;
Returns a handle to a new RLESearch instance.
$self->_compareExact($q_index, $s_index, $m_prob, $u_prob); $self->_compareOrdinary($q_index, $s_index, $m_prob, $u_prob); $self->_compareSurname($q_index, $s_index, $m_prob, $u_prob); $self->_compareGivenName($q_index, $s_index, $m_prob, $u_prob); $self->_compareNumeric($q_index, $s_index, $m_prob, $u_prob); $self->_compareFreq($q_index, $s_index, \%ftbl);
A subclass calls these methods to define the comparators for RLESearch to use. To compare a pair of records, RLESearch calls comparators in the order in which they were defined, saves the weight returned by each in consecutive elements of the binit weight vector for the pair, and sums the weights to obtain the binit weight for the pair.
$self->_compareAdj($q_index, $s_index, $m_prob, $u_prob, \@adj_curv, $fixed, $case, $exact); $self->_compareFreqAdj($q_index, $s_index, \%ftbl, \@adj_curv, $fixed, $case, $exact);
A subclass can use these functions to specify its own adjustment curves and weighted string comparison options.
($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[3])
Comparison values less than or equal to $adj_curv[0] and greater than $adj_curv[1] receive the weight determined by the formula:
($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[2])
$self->_addComparator(\&sub);
Enables a subclass to provide its own customized comparators. RLESearch calls comparators with the argument list (\@qrec, \@srec, \@wvec). The definitions of these arguments are the same as described for RLESearch::search.
This permits the implementation of very general comparators, which can test multiple attributes of the query and search records, and by examining @wvec, determine the outcome of previous comparisons. For example, a subclass could use _compareGivenName and _compareOrdinary to compare given and middle name attrbutes, and use _addComparator to supply a comparator which checked \@wvec to see if both these comparisons failed, as indicated by negative weights for the corresponding elements. If so, the custom comparator could cross-compare the given and middle name attributes to see if they might have been swapped, and return a smaller aggreement weight if so.
$wgt = w_strcmp($s1, $s2, $fixed, $case, $exact);
Returns the weight calculated by NED::Strcmp95::strcmp95 on strings
$s1
and $s2
after padding the shorter string with
spaces to make the string lengths equal. See custom comparators above for a description of $fixed, $case, and $exact.
$wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);
Applies the adjustment curve @adj_curv
to
$cmp_val
as described under
custom comparators above.
Keith Gorlen Center for Information Technology National Institutes of Health Federal Building, Room 816A 7550 Wisconsin Ave MSC 9100 BETHESDA MD 20892-9100 Phone: 301-496-1111, FAX: 301-594-1151 Email: kg2d@nih.gov
NED::RegSearch Registration Search using RLESearch NED::AbsFreqTbl Absolute frequency tables for record linking NED::Strcmp95 Weighted string comparison
http://www.census.gov/srd/www/reclink/reclink.html