NAME

	More about Record Linking
	Outline of Usage
	RLESearch Public Instance Methods
	RLESearch Subclass-Provided Methods
	RLESearch Subclass-Restricted Methods

AUTHOR

NAME

RLESearch - Probabilistic record linking engine

SYNOPSIS

  use NED::RLESearch;

  $ary_ref = $rleh->search($query);
  $ary_ref = $rleh->search($query, $low_cutoff);

  $hash_ref = $rleh->query;
  $ary_ref = $rleh->queryRecs;
  $ary_ref = $rleh->searchRecs;
  $str = $rleh->searchSQL;
  $ary_ref = $rleh->lowCutoff;
  $max_wgt = $rleh->maxWgt;
  $max_wgt = $rleh->maxWgt($query);
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt;
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);

  $rleh = RLESearch->_new($dbi_handle);
  $self->_compareExact($q_index, $s_index,
                       $m_prob, $u_prob);
  $self->_compareOrdinary($q_index, $s_index,
                          $m_prob, $u_prob);
  $self->_compareSurname($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareGivenName($q_index, $s_index,
                           $m_prob, $u_prob);
  $self->_compareNumeric($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareFreq($q_index, $s_index, \%AbsFreqTbl);
  $self->_addComparator(sub { ... });
  $self->_compareAdj($q_index, $s_index, $m_prob, $u_prob,
                     \@adj_curv, $fixed, $case, $exact);
  $self->_compareFreqAdj($q_index, $s_index, \%AbsFreqTbl,
                         \@adj_curv, $fixed, $case, $exact);
  $self->_addComparator(\&sub);
  $wgt = w_strcmp($s1, $s2, $fixed, $case, $exact);
  $wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);

DESCRIPTION

Record linking refers to the process of determining if two records belong to the same individual.

Record linking is easy in situations where a decision can be made based on the agreement or disagreement of a single attribute, for example, an employee identification number or the Social Security Number (SSN). However, it becomes more difficult when the records to be linked do not contain such an attribute, and the decision must be based either on a single attribute that may partially agree (such as a name) or several attributes of which only some may agree (such as organization, office address, and office telephone number).

The more difficult cases may be handled by applying probabilistic record linking. Briefly, the record linking engine RLESearch calculates a number, called a binit weight, which is the log base 2 of the odds that two records constitute a linked pair, i.e., that they belong to the same individual. Thus, a positive binit weight of, say, +10 indicates that the odds are about 1,000 to 1 in favor of a linkage, a negative binit weight of -10 indicates odds of about 1,000 to 1 against a linkage (an unlinked pair), and a binit weight of 0 indicates even odds in favor of (or against) a linkage.

Outline of Usage

RLESearch calculates the binit weight for all possible pairs of M query records and N search records, discards all pairs with binit weights less than or equal to a low cutoff value, keeps only the most probable pairing for each specific search record, and returns an array of the probable linked pairs (i.e. matches), ordered by decreasing binit weight.

RLESearch is intended for applications where M x N < 1000, such as interactive searches of a database for matches to a single, manually-entered query record, or as a component of a meta-directory ``join rule'' for handling difficult cases.

RLESearch is designed for use as a base class. An application-specific subclass provides functions for comparing record attributes (called comparators) and the sets of query and search records. However, RLESearch does provide templates for some common comparators:

exact comparison, appropriate for short (< 4 character) identifiers, initials, and acronyms

approximate comparison, appropriate for names of people, streets, businesses, etc.

specialized approximate comparison for surnames, given names, and numbers such as house numbers

frequency-based approximate comparison, which adjusts for the decreased significance of matches on commonly occuring attribute values such as ``SMITH'' compared to rare values such as ``GORLEN''.

Typically, a calling application will:

establish a connection to a database using DBI, obtaining a database handle;
call the subclass new method with the database handle as an argument, obtaining a handle for a search object that has been initialized with comparators and any necessary frequency tables;
call the RLESearch::search method with a query string, obtaining a reference to an array containing search results;
process the search results;
repeat steps 3 and 4 for different query strings, if desired.

RLESearch also provides utility methods that an application can call after performing a search to obtain additional details about the last search.

RLESearch Public Instance Methods

search

  $ary_ref = $rleh->search($query);
  $ary_ref = $rleh->search($query, $low_cutoff);

Searches the connected database for probable records that match $query, and returns the results as an array reference. $rleh is a handle returned by the new method of an RLESearch subclass.

The argument $query specifies a query as pairs of attributes and values and may be either a string or a hash reference. If a string, it must use Perl's syntax for initializing a hash, excluding the enclosing braces. Search converts a query string to upper case and evals it into a hash. If $query is a reference, search uses it as-is.

Pairs with binit weights less than or equal to $low_cutoff are discarded. If not specified, search uses a $low_cutoff of 0.0.

Search returns results as an array reference. Each element of the result array is a reference to an array with the following elements:

  [$wgt, $skey, \@wvec, \@qrec, \@srec]

where:

$wgt: the binit weight of the linked pair, which is the sum of the \@wvec weights.
$skey: the key (usually a sequence number) of the search record of the linked pair, which must be element 0 of every search record.
\@wvec: the binit weight vector of the linked pair, which is comprised of the weights returned by each of the comparators
\@qrec: a reference to the array containing the query record of the linked pair
\@srec: a reference to the array containing the search record of the linked pair

query

  $hash_ref = $rleh->query;

Returns a reference to the last query hash from the last call to RLESearch::search.

queryRecs

  $ary_ref = $rleh->queryRecs;

Returns a reference to an array of references to the query records from the last call to RLESearch::search.

searchRecs

  $ary_ref = $rleh->searchRecs;

Returns a reference to an array of references to the search records from the last call to RLESearch::search.

lowCutoff

  $ary_ref = $rleh->lowCutoff;

Returns the low cutoff binit weight from the last call to RLESearch::search.

maxWgt

  $max_wgt = $rleh->maxWgt;
  $max_wgt = $rleh->maxWgt($query);
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt;
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);

Returns the maximum possible binit weight for the query specified by $query (see search), or from the last call to RLESearch::search if $query is not supplied. This is the maximum weight obtained by comparing all the query records to themselves.

Calling maxWgt in array context also returns the individual comparator weights, of which $max_wgt is the sum.

RLESearch Subclass-Provided Methods

An RLESearch subclass must implement the methods described in this section.

_genQueryRecs

  $ary_ref = $self->_genQueryRecs(\%query);

Returns an array reference to an array of references to the query records corresponding to the argument \%query (see search).

_genSearchRecs

  $ary_ref = $self->_genSearchRecs(\%query);

Returns an array reference to an array of references to the records to be searched for linkage to the argument \%query (see search).

RLESearch Subclass-Restricted Methods

The methods described in this section are intended to be called by RLESearch subclasses.

_new

  $self = $class->SUPER::_new;

Returns a handle to a new RLESearch instance.

comparator templates

  $self->_compareExact($q_index, $s_index,
                       $m_prob, $u_prob);
  $self->_compareOrdinary($q_index, $s_index,
                          $m_prob, $u_prob);
  $self->_compareSurname($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareGivenName($q_index, $s_index,
                           $m_prob, $u_prob);
  $self->_compareNumeric($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareFreq($q_index, $s_index, \%ftbl);

A subclass calls these methods to define the comparators for RLESearch to use. To compare a pair of records, RLESearch calls comparators in the order in which they were defined, saves the weight returned by each in consecutive elements of the binit weight vector for the pair, and sums the weights to obtain the binit weight for the pair.

_compareExact: Performs an exact comparison of two attribute values, returning the full agreement weight if they match exactly, the full disagreement weight if they do not match, and zero if either or both are null. Exact comparison is appropriate for comparing short (< 4 character) identifiers, initials, and acronyms.
_compareOrdinary: Performs a weighted comparison of two attribute values using Strcmp95, and adjusts the weight to a value between the full agreement weight (for an exact match) and the full disagreement weight, depending on the similarity of the values. The adjustment is done using a general-purpose curve. Returns zero if either of both attribute values are null. Ordinary comparison is appropriate for comparing names of people, streets, businesses, etc.
_compareSurname: Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing surnames.
_compareGivenName: Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing given names.
_compareNumeric: Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing numbers such as house numbers.
_compareFreq: Similar to _compareOrdinary, but uses value-specific weights provided by an instance of class AbsFreqTbl.

Comparator template arguments are as follows:

$q_index: Query record array index of the query attribute to be compared.
$s_index: Search record array index of the search attribute to be compared.
$m_prob: M probability, i.e, the probability that a linked pair agrees on this attribute.
$u_prob: U probability, i.e, the probability that an unlinked pair agrees on this attribute.
\%ftbl: A reference to the AbsFreqTbl object to be used by this comparator.

custom comparators

  $self->_compareAdj($q_index, $s_index, $m_prob, $u_prob,
                     \@adj_curv, $fixed, $case, $exact);
  $self->_compareFreqAdj($q_index, $s_index, \%ftbl,
                     \@adj_curv, $fixed, $case, $exact);

A subclass can use these functions to specify its own adjustment curves and weighted string comparison options.

\@adj_curv

A four-element array that specifies a piecewise-linear adjustment curve to apply to the string comparison weight. String comparison values above the value of $adj_curv[0] receive the full agreement values. Comparison values ($cmp_val) less than or equal to the value of $adj_curv[1] receive the weight determined by the formula:

  ($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[3])

Comparison values less than or equal to $adj_curv[0] and greater than $adj_curv[1] receive the weight determined by the formula:

  ($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[2])

fixed

When 0, increases the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.

case

When 0, all lower case characters are converted to upper case prior to the comparison.

exact

When 0, counts similar characters in addition to exact comparisons.

_addComparator

  $self->_addComparator(\&sub);

Enables a subclass to provide its own customized comparators. RLESearch calls comparators with the argument list (\@qrec, \@srec, \@wvec). The definitions of these arguments are the same as described for RLESearch::search.

This permits the implementation of very general comparators, which can test multiple attributes of the query and search records, and by examining @wvec, determine the outcome of previous comparisons. For example, a subclass could use _compareGivenName and _compareOrdinary to compare given and middle name attrbutes, and use _addComparator to supply a comparator which checked \@wvec to see if both these comparisons failed, as indicated by negative weights for the corresponding elements. If so, the custom comparator could cross-compare the given and middle name attributes to see if they might have been swapped, and return a smaller aggreement weight if so.

w_strcmp

  $wgt = w_strcmp($s1, $s2, $fixed, $case, $exact);

Returns the weight calculated by NED::Strcmp95::strcmp95 on strings $s1 and $s2 after padding the shorter string with spaces to make the string lengths equal. See custom comparators above for a description of $fixed, $case, and $exact.

adjust_strcmp_val

  $wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);

Applies the adjustment curve @adj_curv to $cmp_val as described under custom comparators above.

AUTHOR

  Keith Gorlen
  Center for Information Technology
  National Institutes of Health
  Federal Building, Room 816A
  7550 Wisconsin Ave MSC 9100
  BETHESDA MD 20892-9100
  Phone: 301-496-1111, FAX: 301-594-1151
  Email: kg2d@nih.gov

NAME

SYNOPSIS

DESCRIPTION

More about Record Linking

Outline of Usage

RLESearch Public Instance Methods

RLESearch Subclass-Provided Methods

RLESearch Subclass-Restricted Methods

AUTHOR

SEE ALSO

Related Classes

Books and Journals

Related WWW Links

ACKNOWLEDGEMENTS

	It is more likely that two records that agree only on surname are linked than two records that agree only on first name.
	It is more likely that two records that agree on surname = ``GORLEN'' are linked than two records that agree on surname = ``SMITH''.
	It is more likely that two records with surnames = ``SMITH'' and ``SMITHE'' are linked than two records with surnames = ``SMITH'' and ``BROWN''.

	specific attribute values, e.g. ``GORLEN'' and ``SMITH''
	partial agreement of values, e.g. ``SMITH'' and ``SMITHE''
	comparisons of logically related identifiers, e.g. surnames agree given that the SOUNDEX codes of the surnames agree

	exact comparison, appropriate for short (< 4 character) identifiers, initials, and acronyms
	approximate comparison, appropriate for names of people, streets, businesses, etc.
	specialized approximate comparison for surnames, given names, and numbers such as house numbers
	frequency-based approximate comparison, which adjusts for the decreased significance of matches on commonly occuring attribute values such as ``SMITH'' compared to rare values such as ``GORLEN''.