论文信息 - Overview of Record Linkage and Current Research Directions

Overview of Record Linkage and Current Research Directions

This paper provides background on record linkage methods that can be used in combining data from a variety of sources such as person lists business lists. It also gives some areas of current research.

W. Winkler

[1] W. Deming,et al. On a Method of Estimating Birth and Death Rates and the Extent of Registration (Excerpt) , 1949 .

[2] Samuel B. Williams,et al. ASSOCIATION FOR COMPUTING MACHINERY , 2000 .

[3] W. Deming,et al. On the Problem of Matching Lists by Samples , 1959 .

[4] H B NEWCOMBE,et al. Automatic linkage of vital records. , 1959, Science.

[5] Howard B. Newcombe,et al. Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[6] H. Newcombe,et al. Methods for Computer Linkage of Hospital Admission-Separation Records into Cumulative Health Histories , 1975, Methods of Information in Medicine.

[7] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8] William S. Cooper,et al. Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[9] Antonio Zamora,et al. Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[10] Howard B. Newcombe,et al. Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[11] Matthew A. Jaro,et al. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[12] William E. Winkler,et al. Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage , 1989 .

[13] William E. Winkler. On Dykstra's Iterative Fitting Procedure , 1990 .

[14] William E. Winkler,et al. String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[15] Post-Matching Administrative Record Linkage Between Sole Proprietorship Tax Returns and the Standard Statistical Establishment List , 1992 .

[16] Fritz Scheuren,et al. Regression Analysis of Data Files that Are Computer Matched , 1993 .

[17] W. Winkler. IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[18] William E. Winkler,et al. Advanced Methods For Record Linkage , 1994 .

[19] Salvatore J. Stolfo,et al. The merge/purge problem for large databases , 1995, SIGMOD '95.

[20] D. Rubin,et al. A method for calibrating false-match rates in record linkage , 1995 .

[21] Yoav Freund,et al. Experiments with a New Boosting Algorithm , 1996, ICML.

[22] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[23] William E. Winkler,et al. Approximate String Comparison and its Effect on an Advanced Record Linkage System , 1997 .

[24] John D. Lafferty,et al. Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[25] Vladimir Cherkassky,et al. The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[26] Edward H. Porter,et al. Approximate String Comparison and its Effect , 1997 .

[27] L. Sweeney. Computational Disclosure Control for Medical Microdata , 1997 .

[28] Peter N. Yianilos,et al. Learning String-Edit Distance , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[29] William E. Winkler,et al. Re-identification Methods for Evaluating the Confidentiality of Analytically Valid Microdata , 1998 .

[30] Avi Pfeffer,et al. Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[31] R. Tibshirani,et al. Additive Logistic Regression : a Statistical View ofBoostingJerome , 1998 .

[32] Roberto Grossi,et al. The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[33] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[34] Yaacov Ritov,et al. Tracking Many Objects with Many Sensors , 1999, IJCAI.

[35] W. Winkler. USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[36] Vladimir N. Vapnik,et al. The Nature of Statistical Learning Theory, Second Edition , 2000, Statistics for Engineering and Information Science.

[37] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[38] Andrew McCallum,et al. Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[39] William E. Yancey. Frequency-Dependent Probability Measures for Record Linkage , 2000 .

[40] W. Winkler. Machine Learning , Information Retrieval , and Record Linkage , 2000 .

[41] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[42] William W. Cohen,et al. Learning to Match and Cluster Entity Names , 2001 .

[43] William E. Winkler. Quality of Very Large Databases , 2001 .

[44] Ben Taskar,et al. Probabilistic Classification and Clustering in Relational Data , 2001, IJCAI.

[45] Sunita Sarawagi,et al. Automatic segmentation of text into structured records , 2001, SIGMOD '01.

[46] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[47] Stuart J. Russell,et al. Approximate inference for first-order probabilistic languages , 2001, IJCAI.

[48] Michael I. Jordan,et al. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[49] Craig A. Knoblock,et al. Learning object identification rules for information integration , 2001, Inf. Syst..

[50] D. Rubin,et al. Iterative Automated Record Linkage Using Mixture Models , 2001 .

[51] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[52] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[53] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[54] Stuart J. Russell,et al. Identity Uncertainty and Citation Matching , 2002, NIPS.

[55] P. Lahiri,et al. MODEL-BASED ANALYSIS OF RECORDS LINKED USING MIXTURE MODELS , 2002 .

[56] Peter Christen,et al. Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[57] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[58] Peter Christen,et al. Probabilistic Name and Address Cleaning and Standardisation , 2002, AusDM.

[59] Luca De Santis,et al. Automatic Record Matching in Cooperative Information Systems , 2002 .

[60] Ben Taskar,et al. Discriminative Probabilistic Models for Relational Data , 2002, UAI.

[61] Vijay S. Iyengar,et al. Transforming data to satisfy privacy constraints , 2002, KDD.

[62] Ahmed K. Elmagarmid,et al. TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[63] Simon D. Woodcock,et al. Disclosure Limitation in Longitudinal Linked Data , 2002 .

[64] Ramasamy Uthurusamy,et al. EVOLVING DATA MINING INTO SOLUTIONS FOR INSIGHTS , 2002 .

[65] Ben Taskar,et al. Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..

[66] Erhard Rahm,et al. COMA - A System for Flexible Combination of Schema Matching Approaches , 2002, VLDB.

[67] Surajit Chaudhuri,et al. Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[68] William E. Winkler,et al. Disclosure Risk Assessment in Perturbative Microdata Protection , 2002, Inference Control in Statistical Databases.

[69] Ben Taskar,et al. Learning on the Test Data: Leveraging Unseen Features , 2003, ICML.

[70] Lars Vilhuber,et al. The Sensitivity of Economic Statistics to Coding Errors in Personal Identifiers , 2003 .

[71] L. Sweeney,et al. Trail Re-Identification: Learning Who You Are From Where You Have Been , 2003 .

[72] Ben Taskar,et al. Link Prediction in Relational Data , 2003, NIPS.

[73] William E. Winkler. Data Cleaning Methods , 2003 .

[74] Andrew McCallum,et al. Object Consolodation by Graph Partitioning with a Conditionally›Trained Distance Metric , 2003 .

[75] Hanan Samet,et al. Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[76] Peter Christen,et al. A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[77] Hiroshi Ishikawa,et al. Exact Optimization for Markov Random Fields with Convex Priors , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[78] Pradeep Ravikumar,et al. A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[79] Pradeep Ravikumar,et al. Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[80] Dale Schuurmans,et al. Learning Mixture Models with the Latent Maximum Entropy Principle , 2003, ICML.

[81] Raymond J. Mooney,et al. Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[82] Chen Li,et al. Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[83] Mikhail Bilenko and Raymond J. Mooney,et al. On Evaluation and Training-Set Construction for Duplicate Detection , 2003 .

[84] William E. Winkler,et al. Methods for evaluating and creating data quality , 2004, Inf. Syst..

[85] Jie Wei,et al. Markov Edit Distance , 2004, IEEE Trans. Pattern Anal. Mach. Intell..

[86] William W. Cohen,et al. Exploiting dictionaries in named entity extraction: combining semi-Markov extraction processes and data integration methods , 2004, KDD.

[87] Sudipto Guha,et al. Merging the Results of Approximate Match Operations , 2004, VLDB.

[88] P. Ivax,et al. A THEORY FOR RECORD LINKAGE , 2004 .

[89] Pradeep Ravikumar,et al. Variational Chernoff Bounds for Graphical Models , 2004, UAI.

[90] Divesh Srivastava,et al. Flexible String Matching Against Large Databases in Practice , 2004, VLDB.

[91] D. Ruppert. The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[92] Bianca Zadrozny,et al. Learning and evaluating classifiers under sample selection bias , 2004, ICML.

[93] William E. Winkler,et al. Masking and Re-identification Methods for Public-Use Microdata: Overview and Research Problems , 2004, Privacy in Statistical Databases.

[94] William E. Yancey. An Adaptive String Comparator for Record Linkage , 2004 .

[95] Ulf Brefeld,et al. Co-EM support vector learning , 2004, ICML.

[96] Pradeep Ravikumar,et al. A Hierarchical Graphical Model for Record Linkage , 2004, UAI.

[97] Vicenç Torra,et al. OWA operators in data modeling and reidentification , 2004, IEEE Transactions on Fuzzy Systems.

[98] John M. Abowd,et al. Multiply-Imputing Confidential Characteristics and File Links in Longitudinal Linked Data , 2004, Privacy in Statistical Databases.

[99] William E. Winkler,et al. Re-identification Methods for Masked Microdata , 2004, Privacy in Statistical Databases.

[100] Renato Bruni. Discrete models for data imputation , 2004, Discret. Appl. Math..

[101] Tong Zhang,et al. Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[102] Eugene Agichtein,et al. Mining reference tables for automatic text segmentation , 2004, KDD.

[103] Michael D. Larsen,et al. Hierarchical Bayesian Record Linkage Theory , 2005 .

[104] P. Lahiri,et al. Regression Analysis With Linked Data , 2005 .

[105] W. Winkler. SERIES ( Statistics # 2005-02 ) Approximate String Comparator Search Strategies for Very Large Administrative Lists , 2005 .

[106] Andrew McCallum,et al. A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[107] Rajeev Motwani,et al. Robust identification of fuzzy duplicates , 2005, 21st International Conference on Data Engineering (ICDE'05).

[108] Andrew McCallum,et al. Joint deduplication of multiple record types in relational data , 2005, CIKM '05.

[109] Philip S. Yu,et al. An improved categorization of classifier's sensitivity on sample selection bias , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[110] William E. Yancey. Evaluating String Comparator Performance for Record Linkage , 2005 .

[111] Jayant Madhavan,et al. Reference reconciliation in complex information spaces , 2005, SIGMOD '05.

[112] Renato Bruni,et al. Error correction for massive datasets , 2005, Optim. Methods Softw..

[113] Zoubin Ghahramani,et al. Proceedings of the 24th international conference on Machine learning , 2007, ICML 2007.

[114] Jennifer Widom,et al. Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.