Iterative Automated Record Linkage Using Mixture Models

The goal of record linkage is to link quickly and accurately records that correspond to the same person or entity. Whereas certain patterns of agreements and disagreements on variables are more likely among records pertaining to a single person than among records for different people, the observed patterns for pairs of records can be viewed as arising from a mixture of matches and nonmatches. Mixture model estimates can be used to partition record pairs into two or more groups that can be labeled as probable matches (links) and probable nonmatches (nonlinks). A method is proposed and illustrated that uses marginal information in the database to select mixture models, identifies sets of records for clerks to review based on the models and marginal information, incorporates clerically reviewed data, as they become available, into estimates of model parameters, and classifies pairs as links, nonlinks, or in need of further clerical review. The procedure is illustrated with five datasets from the U.S. Bureau of the Census. It appears to be robust to variations in record-linkage sites. The clerical review corrects classifications of some pairs directly and leads to changes in classification of others through reestimation of mixture models.

[1]  William E. Winkler EXACT MATCHING LISTS OF BUSINESSES: BLOCKING, SUBFIELD IDENTIFICATION, AND INFORMATION THEORY , 2002 .

[2]  L. A. Goodman Exploratory latent structure analysis using both identifiable and unidentifiable models , 1974 .

[3]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[4]  李幼升,et al.  Ph , 1989 .

[5]  S. Haberman Product Models for Frequency Tables Involving Indirect Observation , 1977 .

[6]  B. Lindsay,et al.  A New Index of Fit Based on Mixture Methods for the Analysis of Contingency Tables , 1994 .

[7]  Anton K. Formann,et al.  Linear Logistic Latent Class Analysis , 1982 .

[8]  Stephen E. Fienberg,et al.  Discrete Multivariate Analyses: Theory and Practice , 1977 .

[9]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[10]  F. V. D. Pol,et al.  MIXED MARKOV LATENT CLASS MODELS , 1990 .

[11]  C. Clogg,et al.  A NEW INDEX OF STRUCTURE FOR THE ANALYSIS OF MODELS FOR MOBILITY TABLES AND OTHER CROSS-CLASSIFICATIONS , 1995 .

[12]  I. Yang,et al.  7. Latent Class Marginal Models for Cross-Classifications of Counts , 1998 .

[13]  Yves Thibaudeau The Discrimination Power of Dependency Structures in Record Linkage , 1992 .

[14]  M. Espeland,et al.  Using latent class models to characterize and assess relative error in discrete measurements. , 1989, Biometrics.

[15]  G. Friedman Automated record linkage , 1990, Clinical pharmacology and therapeutics.

[16]  S. Haberman Analysis of qualitative data , 1978 .

[17]  A. Formann Linear Logistic Latent Class Analysis for Polytomous Data , 1992 .

[18]  Xiao-Li Meng,et al.  Maximum likelihood estimation via the ECM algorithm: A general framework , 1993 .

[19]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[20]  J. Hagenaars Latent Structure Models with Direct Effects between Indicators , 1988 .

[21]  S. Haberman,et al.  Log‐Linear Fit for Contingency Tables , 1972 .

[22]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[23]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[24]  H. Newcombe,et al.  Methods for Computer Linkage of Hospital Admission-Separation Records into Cumulative Health Histories , 1975, Methods of Information in Medicine.

[25]  Matthew A. Jaro,et al.  Probabilistic linkage of large public health data files. , 1995, Statistics in medicine.

[26]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[27]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[28]  Pierre Lalonde,et al.  The Use of Names for Linking Personal Records , 1992 .

[29]  M. G. Arellano The Use of Names for Linking Personal Records: Comment , 1992 .

[30]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[31]  Shelby J. Haberman,et al.  A Stabilized Newton-Raphson Algorithm for Log-Linear Models for Frequency Tables Derived by Indirect Observation , 1988 .

[32]  S. Haberman Log-Linear Models for Frequency Tables Derived by Indirect Observation: Maximum Likelihood Equations , 1974 .