Overview of Record Linkage and Current Research Directions

This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.

[1]  W. Deming,et al.  On a Method of Estimating Birth and Death Rates and the Extent of Registration (Excerpt) , 1949 .

[2]  Samuel B. Williams,et al.  ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[3]  W. Deming,et al.  On the Problem of Matching Lists by Samples , 1959 .

[4]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[5]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[6]  H. Newcombe,et al.  Methods for Computer Linkage of Hospital Admission-Separation Records into Cumulative Health Histories , 1975, Methods of Information in Medicine.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  William S. Cooper,et al.  Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[9]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[10]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[11]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[12]  William E. Winkler,et al.  Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage , 1989 .

[13]  William E. Winkler On Dykstra's Iterative Fitting Procedure , 1990 .

[14]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[15]  Post-Matching Administrative Record Linkage Between Sole Proprietorship Tax Returns and the Standard Statistical Establishment List , 1992 .

[16]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[17]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[18]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[19]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[20]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[21]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[22]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[23]  William E. Winkler,et al.  Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[24]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[26]  Edward H. Porter,et al.  Approximate String Comparison and its Effect , 1997 .

[27]  L. Sweeney Computational Disclosure Control for Medical Microdata , 1997 .

[28]  Peter N. Yianilos,et al.  Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  William E. Winkler,et al.  Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata , 1998 .

[30]  Avi Pfeffer,et al.  Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[31]  R. Tibshirani,et al.  Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[32]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[33]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[34]  Yaacov Ritov,et al.  Tracking Many Objects with Many Sensors , 1999, IJCAI.

[35]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[36]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory, Second Edition , 2000, Statistics for Engineering and Information Science.

[37]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[38]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[39]  William E. Yancey Frequency-Dependent Probability Measures for Record Linkage , 2000 .

[40]  W. Winkler Machine Learning , Information Retrieval , and Record Linkage , 2000 .

[41]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[42]  William W. Cohen,et al.  Learning to Match and Cluster Entity Names , 2001 .

[43]  William E. Winkler Quality of Very Large Databases , 2001 .

[44]  Ben Taskar,et al.  Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[45]  Sunita Sarawagi,et al.  Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[46]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[47]  Stuart J. Russell,et al.  Approximate inference for first-order probabilistic languages , 2001, IJCAI.

[48]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[49]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[50]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[51]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[52]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[53]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[54]  Stuart J. Russell,et al.  Identity Uncertainty and Citation Matching , 2002, NIPS.

[55]  P. Lahiri,et al.  MODEL-BASED ANALYSIS OF RECORDS LINKED USING MIXTURE MODELS , 2002 .

[56]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[57]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[58]  Peter Christen,et al.  Probabilistic Name and Address Cleaning and Standardisation , 2002, AusDM.

[59]  Luca De Santis,et al.  Automatic Record Matching in Cooperative Information Systems , 2002 .

[60]  Ben Taskar,et al.  Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[61]  Vijay S. Iyengar,et al.  Transforming data to satisfy privacy constraints , 2002, KDD.

[62]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[63]  Simon D. Woodcock,et al.  Disclosure Limitation in Longitudinal Linked Data , 2002 .

[64]  Ramasamy Uthurusamy,et al.  EVOLVING DATA MINING INTO SOLUTIONS FOR INSIGHTS , 2002 .

[65]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[66]  Erhard Rahm,et al.  COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[67]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[68]  William E. Winkler,et al.  Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[69]  Ben Taskar,et al.  Learning on the Test Data: Leveraging Unseen Features , 2003, ICML.

[70]  Lars Vilhuber,et al.  The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers , 2003 .

[71]  L. Sweeney,et al.  Trail Re-Identification: Learning Who You Are From Where You Have Been , 2003 .

[72]  Ben Taskar,et al.  Link Prediction in Relational Data , 2003, NIPS.

[73]  William E. Winkler Data Cleaning Methods , 2003 .

[74]  Andrew McCallum,et al.  Object Consolodation by Graph Partitioning with a Conditionally›Trained Distance Metric , 2003 .

[75]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[76]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[77]  Hiroshi Ishikawa,et al.  Exact Optimization for Markov Random Fields with Convex Priors , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[78]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[79]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[80]  Dale Schuurmans,et al.  Learning Mixture Models with the Latent Maximum Entropy Principle , 2003, ICML.

[81]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[82]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[83]  Mikhail Bilenko and Raymond J. Mooney,et al.  On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[84]  William E. Winkler,et al.  Methods for evaluating and creating data quality , 2004, Inf. Syst..

[85]  Jie Wei,et al.  Markov Edit Distance , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[86]  William W. Cohen,et al.  Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[87]  Sudipto Guha,et al.  Merging the Results of Approximate Match Operations , 2004, VLDB.

[88]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[89]  Pradeep Ravikumar,et al.  Variational Chernoff Bounds for Graphical Models , 2004, UAI.

[90]  Divesh Srivastava,et al.  Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[91]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[92]  Bianca Zadrozny,et al.  Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[93]  William E. Winkler,et al.  Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems , 2004, Privacy in Statistical Databases.

[94]  William E. Yancey An Adaptive String Comparator for Record Linkage , 2004 .

[95]  Ulf Brefeld,et al.  Co-EM support vector learning , 2004, ICML.

[96]  Pradeep Ravikumar,et al.  A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[97]  Vicenç Torra,et al.  OWA operators in data modeling and reidentification , 2004, IEEE Transactions on Fuzzy Systems.

[98]  John M. Abowd,et al.  Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[99]  William E. Winkler,et al.  Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[100]  Renato Bruni Discrete models for data imputation , 2004, Discret. Appl. Math..

[101]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[102]  Eugene Agichtein,et al.  Mining reference tables for automatic text segmentation , 2004, KDD.

[103]  Michael D. Larsen,et al.  Hierarchical Bayesian Record Linkage Theory , 2005 .

[104]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[105]  W. Winkler SERIES ( Statistics # 2005-02 ) Approximate String Comparator Search Strategies for Very Large Administrative Lists , 2005 .

[106]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[107]  Rajeev Motwani,et al.  Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[108]  Andrew McCallum,et al.  Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[109]  Philip S. Yu,et al.  An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[110]  William E. Yancey Evaluating String Comparator Performance for Record Linkage , 2005 .

[111]  Jayant Madhavan,et al.  Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[112]  Renato Bruni,et al.  Error correction for massive datasets , 2005, Optim. Methods Softw..

[113]  Zoubin Ghahramani,et al.  Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[114]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.