Methods for evaluating and creating data quality

This paper provides a survey of two classes of methods that can be used in determining and improving the quality of individual files or groups of files. The first are edit/imputation methods for maintaining business rules and for imputing for missing data. The second are methods of data cleaning for finding duplicates within files or across files.

[1]  William E. Winkler,et al.  SET-COVERING AND EDITING DISCRETE DATA , 1998 .

[2]  Matthew A. Jaro,et al.  Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida , 1989 .

[3]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[4]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[5]  William E. Winkler EDITING DISCRETE DATA , 1997 .

[6]  Clement T. Yu,et al.  Term Weighting in Information Retrieval Using the Term Precision Model , 1982, JACM.

[7]  W. Winkler IMPROVED DECISION RULES IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 1993 .

[8]  William E. Winkler,et al.  Advanced Methods For Record Linkage , 1994 .

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[11]  Felix Naumann,et al.  Object Identification Quality , 2003 .

[12]  Ming‐Pi Mi Handbook of record linkage: Methods for health and statistical studies, administration, and business, Howard B. Newcombe, Oxford, England: Oxford University Press, 1988, 210 pp, $40.00 , 1989 .

[13]  P. Ivax,et al.  A THEORY FOR RECORD LINKAGE , 2004 .

[14]  W. Winkler USING THE EM ALGORITHM FOR WEIGHT COMPUTATION IN THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE , 2000 .

[15]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[16]  Craig A. Knoblock,et al.  Learning object identification rules for information integration , 2001, Inf. Syst..

[17]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[18]  Lise Getoor,et al.  Learning Probabilistic Relational Models , 1999, IJCAI.

[19]  D. Rubin,et al.  Iterative Automated Record Linkage Using Mixture Models , 2001 .

[20]  Chen Li,et al.  Efficient record linkage in large data sets , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[21]  William E. Winkler Quality of Very Large Databases , 2001 .

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Craig A. Knoblock,et al.  Learning domain-independent string transformation weights for high accuracy object identification , 2002, KDD.

[24]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[25]  Bor-Chung Chen,et al.  Set Covering Algorithms in Edit Generation , 1998 .

[26]  Eric R. Ziegel,et al.  Business survey methods , 1995 .

[27]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[28]  Larry P. English Improving Data Warehouse and Business Information Quality: Methods for Reducing Costs and Increasing Profits , 1999 .

[29]  Peter Christen,et al.  Preparation of name and address data for record linkage using hidden Markov models , 2002, BMC Medical Informatics Decis. Mak..

[30]  T. De Waal A Fast and Simple Algorithm for Automatic Editing of Mixed Data , 2003 .

[31]  Howard B. Newcombe,et al.  Record linkage: making maximum use of the discriminating power of identifying information , 1962, CACM.

[32]  William S. Cooper,et al.  Foundations of Probabilistic and Utility-Theoretic Indexing , 1978, JACM.

[33]  Antonio Zamora,et al.  Automatic spelling correction in scientific and scholarly text , 1984, CACM.

[34]  D. Rubin,et al.  A method for calibrating false-match rates in record linkage , 1995 .

[35]  William E. Winkler,et al.  Methods for Record Linkage and Bayesian Networks , 2002 .

[36]  P. Lahiri,et al.  Regression Analysis With Linked Data , 2005 .

[37]  William E. Winkler,et al.  String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. , 1990 .

[38]  William E. Winkler,et al.  THE DISCRETE EDIT SYSTEM , 1997 .

[39]  William E. Yancey Improving EM Algorithm Estimates for Record Linkage Parameters , 2002 .

[40]  Erhard Rahm,et al.  A survey of approaches to automatic schema matching , 2001, The VLDB Journal.

[41]  Antonio Sassano,et al.  Optimization Techniques for an Error Free Data Collecting , 2001 .

[42]  Luca De Santis,et al.  Automatic Record Matching in Cooperative Information Systems , 2002 .

[43]  William W. Cohen,et al.  Learning to match and cluster large high-dimensional data sets for data integration , 2002, KDD.

[44]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[45]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[46]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[47]  David Loshin Enterprise knowledge management: the data quality approach , 2000 .

[48]  R. Burkard,et al.  Assignment and Matching Problems: Solution Methods with FORTRAN-Programs , 1980 .

[49]  W. Winkler Machine Learning , Information Retrieval , and Record Linkage , 2000 .

[50]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[51]  Fritz Scheuren,et al.  Regression Analysis of Data Files that Are Computer Matched , 1993 .

[52]  Tiziana Catarci,et al.  Managing Data Quality in Cooperative Information Systems , 2002, OTM.

[53]  Avi Pfeffer,et al.  Probabilistic Frame-Based Systems , 1998, AAAI/IAAI.

[54]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[55]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[56]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[57]  Romina Fraboni,et al.  Economic Commission for Europe. , 1982, POPIN bulletin.

[58]  D. Holt,et al.  A Systematic Approach to Automatic Edit and Imputation , 1976 .

[59]  R. S. Garfinkel,et al.  Optimal Imputation of Erroneous Data: Categorical Data, General Edits , 1986, Oper. Res..

[60]  Thomas Redman,et al.  Data quality for the information age , 1996 .

[61]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[62]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.