NAME
SYNOPSIS
DESCRIPTION
More about Record Linking
Outline of Usage
RLESearch Public Instance Methods
RLESearch Subclass-Provided Methods
RLESearch Subclass-Restricted Methods
AUTHOR
SEE ALSO
Related Classes
Books and Journals
Related WWW Links
ACKNOWLEDGEMENTS

 

NAME

RLESearch - Probabilistic record linking engine

 

 

SYNOPSIS

  use NED::RLESearch;

 

  $ary_ref = $rleh->search($query);
  $ary_ref = $rleh->search($query, $low_cutoff);

 

  $hash_ref = $rleh->query;
  $ary_ref = $rleh->queryRecs;
  $ary_ref = $rleh->searchRecs;
  $str = $rleh->searchSQL;
  $ary_ref = $rleh->lowCutoff;
  $max_wgt = $rleh->maxWgt;
  $max_wgt = $rleh->maxWgt($query);
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt;
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);

 

  $rleh = RLESearch->_new($dbi_handle);
  $self->_compareExact($q_index, $s_index,
                       $m_prob, $u_prob);
  $self->_compareOrdinary($q_index, $s_index,
                          $m_prob, $u_prob);
  $self->_compareSurname($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareGivenName($q_index, $s_index,
                           $m_prob, $u_prob);
  $self->_compareNumeric($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareFreq($q_index, $s_index, \%AbsFreqTbl);
  $self->_addComparator(sub { ... });
  $self->_compareAdj($q_index, $s_index, $m_prob, $u_prob,
                     \@adj_curv, $fixed, $case, $exact);
  $self->_compareFreqAdj($q_index, $s_index, \%AbsFreqTbl,
                         \@adj_curv, $fixed, $case, $exact);
  $self->_addComparator(\&sub);
  $wgt = w_strcmp($s1, $s2, $fixed, $case, $exact);
  $wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);

 

 

DESCRIPTION

Record linking refers to the process of determining if two records belong to the same individual.

Record linking is easy in situations where a decision can be made based on the agreement or disagreement of a single attribute, for example, an employee identification number or the Social Security Number (SSN). However, it becomes more difficult when the records to be linked do not contain such an attribute, and the decision must be based either on a single attribute that may partially agree (such as a name) or several attributes of which only some may agree (such as organization, office address, and office telephone number).

The more difficult cases may be handled by applying probabilistic record linking. Briefly, the record linking engine RLESearch calculates a number, called a binit weight, which is the log base 2 of the odds that two records constitute a linked pair, i.e., that they belong to the same individual. Thus, a positive binit weight of, say, +10 indicates that the odds are about 1,000 to 1 in favor of a linkage, a negative binit weight of -10 indicates odds of about 1,000 to 1 against a linkage (an unlinked pair), and a binit weight of 0 indicates even odds in favor of (or against) a linkage.

 

 

More about Record Linking

Linkage is the bringing together of information from two database records that relate to the same individual. Calculating the likelihood that the linkage is correct makes the linkage process probabilistic.

The degree of certainty that a linkage is correct depends upon the comparisons of available attributes (or fields) of the records, and the outcomes of these comparisons. Generally, agreement between the values of an attribute in a pair of records argues in favor of accepting them as a linked pair, while disagreement of attribute values is characteristic of an unlinked pair.

However, agreement of various attributes and values have varying significance. For example:

 
It is more likely that two records that agree only on surname are linked than two records that agree only on first name.

 

It is more likely that two records that agree on surname = ``GORLEN'' are linked than two records that agree on surname = ``SMITH''.

 

It is more likely that two records with surnames = ``SMITH'' and ``SMITHE'' are linked than two records with surnames = ``SMITH'' and ``BROWN''.

 

The odds that a linkage is correct can be calculated by measuring the frequency of the outcomes of a comparison applied to a representative set of linked pairs (called the m probability), and dividing that by the frequency of the outcomes of the same comparison applied to a representative set of unlinked pairs (called the u probability).

When multiple comparisons involving various attributes and values are performed on a pair of records, the overall odds of correct linkage are calculated by simply multiplying together the odds of the individual comparisons. However, it is customary to express the odds as a binit weight, which is the log base 2 of the odds, and to then calculate the total binit weight by summing the binit weights of the individual comparisons.

Note that the representative set of linked pairs need not be large (a few hundred is sufficient to start with), and it need not be perfect.

A representative set of unlinked pairs is not required if simple comparisons are used, because the outcome frequencies can be calculated. Care must be taken when performing complicated (and more powerful) comparisons, which involve:

 
specific attribute values, e.g. ``GORLEN'' and ``SMITH''

 

partial agreement of values, e.g. ``SMITH'' and ``SMITHE''

 

comparisons of logically related identifiers, e.g. surnames agree given that the SOUNDEX codes of the surnames agree

 

 

Outline of Usage

RLESearch calculates the binit weight for all possible pairs of M query records and N search records, discards all pairs with binit weights less than or equal to a low cutoff value, keeps only the most probable pairing for each specific search record, and returns an array of the probable linked pairs (i.e. matches), ordered by decreasing binit weight.

RLESearch is intended for applications where M x N < 1000, such as interactive searches of a database for matches to a single, manually-entered query record, or as a component of a meta-directory ``join rule'' for handling difficult cases.

RLESearch is designed for use as a base class. An application-specific subclass provides functions for comparing record attributes (called comparators) and the sets of query and search records. However, RLESearch does provide templates for some common comparators:

 
exact comparison, appropriate for short (< 4 character) identifiers, initials, and acronyms

 

approximate comparison, appropriate for names of people, streets, businesses, etc.

 

specialized approximate comparison for surnames, given names, and numbers such as house numbers

 

frequency-based approximate comparison, which adjusts for the decreased significance of matches on commonly occuring attribute values such as ``SMITH'' compared to rare values such as ``GORLEN''.

 

Typically, a calling application will:

 

  1. establish a connection to a database using DBI, obtaining a database handle;

     

  2. call the subclass new method with the database handle as an argument, obtaining a handle for a search object that has been initialized with comparators and any necessary frequency tables;

     

  3. call the RLESearch::search method with a query string, obtaining a reference to an array containing search results;

     

  4. process the search results;

     

  5. repeat steps 3 and 4 for different query strings, if desired.

     

RLESearch also provides utility methods that an application can call after performing a search to obtain additional details about the last search.

 

 

RLESearch Public Instance Methods

search
  $ary_ref = $rleh->search($query);
  $ary_ref = $rleh->search($query, $low_cutoff);

Searches the connected database for probable records that match $query, and returns the results as an array reference. $rleh is a handle returned by the new method of an RLESearch subclass.

The argument $query specifies a query as pairs of attributes and values and may be either a string or a hash reference. If a string, it must use Perl's syntax for initializing a hash, excluding the enclosing braces. Search converts a query string to upper case and evals it into a hash. If $query is a reference, search uses it as-is.

Pairs with binit weights less than or equal to $low_cutoff are discarded. If not specified, search uses a $low_cutoff of 0.0.

Search returns results as an array reference. Each element of the result array is a reference to an array with the following elements:

 

  [$wgt, $skey, \@wvec, \@qrec, \@srec]

where:

 

$wgt
the binit weight of the linked pair, which is the sum of the \@wvec weights.

 

$skey
the key (usually a sequence number) of the search record of the linked pair, which must be element 0 of every search record.

 

\@wvec
the binit weight vector of the linked pair, which is comprised of the weights returned by each of the comparators

 

\@qrec
a reference to the array containing the query record of the linked pair

 

\@srec
a reference to the array containing the search record of the linked pair

 

query
  $hash_ref = $rleh->query;

Returns a reference to the last query hash from the last call to RLESearch::search.

 

queryRecs
  $ary_ref = $rleh->queryRecs;

Returns a reference to an array of references to the query records from the last call to RLESearch::search.

 

searchRecs
  $ary_ref = $rleh->searchRecs;

Returns a reference to an array of references to the search records from the last call to RLESearch::search.

 

lowCutoff
  $ary_ref = $rleh->lowCutoff;

Returns the low cutoff binit weight from the last call to RLESearch::search.

 

maxWgt
  $max_wgt = $rleh->maxWgt;
  $max_wgt = $rleh->maxWgt($query);
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt;
  ($max_wgt, $max_wvec_ref) = $rleh->maxWgt($query);

Returns the maximum possible binit weight for the query specified by $query (see search), or from the last call to RLESearch::search if $query is not supplied. This is the maximum weight obtained by comparing all the query records to themselves.

Calling maxWgt in array context also returns the individual comparator weights, of which $max_wgt is the sum.

 

 

RLESearch Subclass-Provided Methods

An RLESearch subclass must implement the methods described in this section.

 

_genQueryRecs
  $ary_ref = $self->_genQueryRecs(\%query);

Returns an array reference to an array of references to the query records corresponding to the argument \%query (see search).

 

_genSearchRecs
  $ary_ref = $self->_genSearchRecs(\%query);

Returns an array reference to an array of references to the records to be searched for linkage to the argument \%query (see search).

 

 

RLESearch Subclass-Restricted Methods

The methods described in this section are intended to be called by RLESearch subclasses.

 

_new
  $self = $class->SUPER::_new;

Returns a handle to a new RLESearch instance.

 

comparator templates
  $self->_compareExact($q_index, $s_index,
                       $m_prob, $u_prob);
  $self->_compareOrdinary($q_index, $s_index,
                          $m_prob, $u_prob);
  $self->_compareSurname($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareGivenName($q_index, $s_index,
                           $m_prob, $u_prob);
  $self->_compareNumeric($q_index, $s_index,
                         $m_prob, $u_prob);
  $self->_compareFreq($q_index, $s_index, \%ftbl);

A subclass calls these methods to define the comparators for RLESearch to use. To compare a pair of records, RLESearch calls comparators in the order in which they were defined, saves the weight returned by each in consecutive elements of the binit weight vector for the pair, and sums the weights to obtain the binit weight for the pair.

 

_compareExact
Performs an exact comparison of two attribute values, returning the full agreement weight if they match exactly, the full disagreement weight if they do not match, and zero if either or both are null. Exact comparison is appropriate for comparing short (< 4 character) identifiers, initials, and acronyms.

 

_compareOrdinary
Performs a weighted comparison of two attribute values using Strcmp95, and adjusts the weight to a value between the full agreement weight (for an exact match) and the full disagreement weight, depending on the similarity of the values. The adjustment is done using a general-purpose curve. Returns zero if either of both attribute values are null. Ordinary comparison is appropriate for comparing names of people, streets, businesses, etc.

 

_compareSurname
Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing surnames.

 

_compareGivenName
Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing given names.

 

_compareNumeric
Similar to _compareOrdinary, but uses an adjustment curve specially designed for comparing numbers such as house numbers.

 

_compareFreq
Similar to _compareOrdinary, but uses value-specific weights provided by an instance of class AbsFreqTbl.

 

Comparator template arguments are as follows:

 

$q_index
Query record array index of the query attribute to be compared.

 

$s_index
Search record array index of the search attribute to be compared.

 

$m_prob
M probability, i.e, the probability that a linked pair agrees on this attribute.

 

$u_prob
U probability, i.e, the probability that an unlinked pair agrees on this attribute.

 

\%ftbl
A reference to the AbsFreqTbl object to be used by this comparator.

 

custom comparators
  $self->_compareAdj($q_index, $s_index, $m_prob, $u_prob,
                     \@adj_curv, $fixed, $case, $exact);
  $self->_compareFreqAdj($q_index, $s_index, \%ftbl,
                     \@adj_curv, $fixed, $case, $exact);

A subclass can use these functions to specify its own adjustment curves and weighted string comparison options.

 

\@adj_curv
A four-element array that specifies a piecewise-linear adjustment curve to apply to the string comparison weight. String comparison values above the value of $adj_curv[0] receive the full agreement values. Comparison values ($cmp_val) less than or equal to the value of $adj_curv[1] receive the weight determined by the formula:

 

  ($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[3])

Comparison values less than or equal to $adj_curv[0] and greater than $adj_curv[1] receive the weight determined by the formula:

 

  ($agrwgt - ($agrwgt-$diswgt)*(1-$cmp_val)*$adj_curv[2])

 

fixed
When 0, increases the probability of a match when the number of matched characters is large. This option allows for a little more tolerance when the strings are large. It is not an appropriate test when comparing fixed length fields such as phone and social security numbers.

 

case
When 0, all lower case characters are converted to upper case prior to the comparison.

 

exact
When 0, counts similar characters in addition to exact comparisons.

 

_addComparator
  $self->_addComparator(\&sub);

Enables a subclass to provide its own customized comparators. RLESearch calls comparators with the argument list (\@qrec, \@srec, \@wvec). The definitions of these arguments are the same as described for RLESearch::search.

This permits the implementation of very general comparators, which can test multiple attributes of the query and search records, and by examining @wvec, determine the outcome of previous comparisons. For example, a subclass could use _compareGivenName and _compareOrdinary to compare given and middle name attrbutes, and use _addComparator to supply a comparator which checked \@wvec to see if both these comparisons failed, as indicated by negative weights for the corresponding elements. If so, the custom comparator could cross-compare the given and middle name attributes to see if they might have been swapped, and return a smaller aggreement weight if so.

 

w_strcmp
  $wgt = w_strcmp($s1, $s2, $fixed, $case, $exact);

Returns the weight calculated by NED::Strcmp95::strcmp95 on strings $s1 and $s2 after padding the shorter string with spaces to make the string lengths equal. See custom comparators above for a description of $fixed, $case, and $exact.

 

adjust_strcmp_val
  $wgt = adjust_strcmp_val($cmp_val, $agrwgt, $diswgt, @adj_curv);

Applies the adjustment curve @adj_curv to $cmp_val as described under custom comparators above.

 

 

AUTHOR

  Keith Gorlen
  Center for Information Technology
  National Institutes of Health
  Federal Building, Room 816A
  7550 Wisconsin Ave MSC 9100
  BETHESDA MD 20892-9100
  Phone: 301-496-1111, FAX: 301-594-1151
  Email: kg2d@nih.gov

 

 

SEE ALSO

 

Related Classes

  NED::RegSearch  Registration Search using RLESearch
  NED::AbsFreqTbl Absolute frequency tables for record linking
  NED::Strcmp95   Weighted string comparison

 

 

Books and Journals

  1. Gill, L. E., and Baldwin, J. A. (1987), ``Methods and technology of record linkage: some practical considerations'' in J. Baldwin, E. D. Acheson, and W. Graham (ed.) Textbook of Medical Record Linkage, Oxford: Oxford University Press, 39-54.

     

  2. Newcombe, H. B. (1988), Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press. Classic book reference. Covers some of the theory and much of the heuristics needed for good record linkage practice. Now out of print.

     

  3. Newcombe, H. B. (1987), ``Record linking: the design of efficient systems for linking records into individual and family histories'' in J. Baldwin, E. D. Acheson, and W. Graham (ed.) Textbook of Medical Record Linkage, Oxford: Oxford University Press, 15-38.

     

  4. Winkler, W. E. (1990), ``String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage'', Proceedings of the Section on Survey Research Methods, American Statistical Association, 354-359. Describes a string comparator metric that partially accounts for typographical variation in strings such as first name or surname, decision rules that utilize the string comparator, and improvements in empirical matching results.

     

  5. Winkler, W. E. (1994), ``Advanced Methods of Record Linkage,'' American Statistical Association, Proceedings of the Section of Survey Research Methods, 467-472. Describes new theory and algorithms in computer science, operations research, and statistics that were developed at the Census Bureau and used in current Census system. Extends original Jaro string comparator and gives likelihood-based methods for connecting the comparators to the main decision rule of Fellegi and Sunter. Introduces a new assignment algorithm for forcing 1-1 matching that is as fast the benchmark Burchard-Derigs algorithm and uses 1/500 as much storage; is also much faster and uses less storage than the MCF algorithm of Klingman. Gives general theory extending EM ideas of Meng and Rubin (Biometrika 1994) and shows how it is applied in estimating record linkage parameters. Gives method for estimating record linkage error rates that holds in more situations than the Belin-Rubin method, that does not require a training set as does the Belin-Rubin method, and requires an ad hoc intervention that tends to limit its application to record linkage experts.

     

  6. Winkler, W. E. (1995), ``Matching and Record Linkage,'' in B. G. Cox et al. (ed.) Business Survey Methods, New York: J. Wiley, 355-384. Survey article that gives much background about record linkage. Describes available software, list acquisition and preparation, and a large number methods for evaluating the quality of lists and the quality of matching results.

     

 

Related WWW Links

  http://www.census.gov/srd/www/reclink/reclink.html

 

 

ACKNOWLEDGEMENTS

Many thanks to William Winkler, Statistical Research Division, U. S. Bureau of the Census, for his software, help, and advice.