Data Cleaning: Problems and Current Approaches

We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning.

[1]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[2]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[3]  Oren Etzioni,et al.  Web document clustering: a feasibility demonstration , 1998, SIGIR '98.

[4]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[5]  Matthias Jarke,et al.  A Model for Data Warehouse Operational Processes , 2000, CAiSE.

[6]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[7]  José Oncina,et al.  Learning Stochastic Regular Grammars by Means of a State Merging Method , 1994, ICGI.

[8]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[9]  Chris Clifton,et al.  SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks , 2000, Data Knowl. Eng..

[10]  Arturo Crespo,et al.  A Survey Of Semi-Automatic Extraction And Transformation , 1994 .

[11]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[12]  Umeshwar Dayal,et al.  An Overview of Repository Technology , 1994, VLDB.

[13]  Panos Vassiliadis,et al.  Gulliver in the land of data warehousing: practical experiences and observations of a researcher , 2000, DMDW.

[14]  Fran eDaniela. Flores,et al.  De laratively leaning your data using AJAX , 2000 .

[15]  Andrei Z. Broder,et al.  Mirror, Mirror on the Web: A Study of Host Pairs with Replicated Content , 1999, Comput. Networks.

[16]  Michael Stonebraker,et al.  Open enterprise data integration , 1999 .

[17]  Howard B. Newcombe,et al.  Handbook of record linkage: methods for health and statistical studies, administration, and business , 1988 .

[18]  Matthias Jarke,et al.  Fundamentals of Data Warehouses , 2000, Springer Berlin Heidelberg.

[19]  Gio Wiederhold,et al.  Mediators in the architecture of future information systems , 1992, Computer.

[20]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[21]  Serge Abiteboul,et al.  Tools for Data Translation and Integration , 1999, IEEE Data Eng. Bull..

[22]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[23]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[24]  K. Minton Extraction Patterns for Information Extraction Tasks : A Survey , 1999 .

[25]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[26]  Erhard Rahm,et al.  On Metadata Interoperability in Data Warehouses , 2000 .

[27]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[28]  Stuart E. Madnick,et al.  Inter-database instance identification in composite information systems , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume III: Decision Support and Knowledge Based Systems Track.

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  H B NEWCOMBE,et al.  Automatic linkage of vital records. , 1959, Science.

[31]  Vipul Kashyap,et al.  Semantic and schematic similarities between database objects: a context-based approach , 1996, The VLDB Journal.

[32]  Oren Etzioni,et al.  A Grammar Inference Algorithm for the World Wide Web , 2002 .

[33]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.

[34]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[35]  Ramakrishnan Srikant,et al.  Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[36]  Michael Stonebraker,et al.  Database research: achievements and opportunities into the 1st century , 1996, SGMD.

[37]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[38]  Craig A. Knoblock,et al.  A hierarchical approach to wrapper induction , 1999, AGENTS '99.

[39]  Joseph M. Hellerstein,et al.  Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning , 2001, VLDB 2001.

[40]  Jeffrey D. Ullman,et al.  Set Merging Algorithms , 1973, SIAM J. Comput..

[41]  Joseph M. Hellerstein,et al.  Potters Wheel: An interactive framework for data cleaning , 2000 .

[42]  Diego Calvanese,et al.  Information integration: conceptual modeling and reasoning support , 1998, Proceedings. 3rd IFCIS International Conference on Cooperative Information Systems (Cat. No.98EX122).

[43]  M. W. Du,et al.  An Approach to Designing Very Fast Approximate String Matching Algorithms , 1994, IEEE Trans. Knowl. Data Eng..

[44]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[45]  Pedro M. Domingos,et al.  Learning Source Description for Data Integration , 2000, WebDB.

[46]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[47]  Elke A. Rundensteiner Letter from the Special Issue Editor , 1999, IEEE Data Eng. Bull..

[48]  Nicholas Kushmerick,et al.  Wrapper induction: Efficiency and expressiveness , 2000, Artif. Intell..

[49]  Stefano Spaccapietra,et al.  Issues and approaches of database integration , 1998, CACM.

[50]  Usama M. Fayyad,et al.  Mining Databases: Towards Algorithms for Knowledge Discovery , 1998, IEEE Data Eng. Bull..

[51]  Hector Garcia-Molina,et al.  Finding replicated Web collections , 2000, SIGMOD '00.

[52]  Chun-Nan Hsu,et al.  Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web , 1998, Inf. Syst..

[53]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[54]  Andrei Z. Broder,et al.  A Comparison of Techniques to Find Mirrored Hosts on the WWW , 2000, IEEE Data Eng. Bull..

[55]  Nicholas Kushmerick,et al.  Regression testing for wrapper maintenance , 1999, AAAI/IAAI.

[56]  A. A. Brooks,et al.  Experiment in computer-assisted duplicate checking , 1976 .

[57]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[58]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[59]  Alvaro E. Monge,et al.  Adaptive detection of approximately duplicate database records and the database integration approach to information discovery , 1998 .

[60]  Jordan Lampe,et al.  Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[61]  Raymond J. Mooney,et al.  Active Learning for Natural Language Parsing and Information Extraction , 1999, ICML.

[62]  William W. Cohen Recognizing Structure in Web Pages using Similarity Queries , 1999, AAAI/IAAI.

[63]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[64]  Ted E. Senator,et al.  The Financial Crimes Enforcement Network AI System (FAIS) Identifying Potential Money Laundering from Reports of Large Cash Transactions , 1995, AI Mag..

[65]  Dayne Freitag,et al.  Boosted Wrapper Induction , 2000, AAAI/IAAI.

[66]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[67]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[68]  Hector Garcia-Molina,et al.  Finding near-replicas of documents on the Web , 1999 .

[69]  Raffaele Giancarlo,et al.  Data structures and algorithms for approximate string matching , 1988, J. Complex..

[70]  Philip A. Bernstein,et al.  Meta-Data Support for Data Transformations Using Microsoft Repository , 1999, IEEE Data Eng. Bull..

[71]  Mauricio Antonio Hernandez-Sherrington A generalization of band joins and the merge/purge problem , 1996 .

[72]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[73]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[74]  C. Sapia,et al.  On Supporting the Data Warehouse Design by Data Mining Techniques , 1999 .

[75]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[76]  Dennis Shasha,et al.  Declaratively Cleaning your Data with AJAX , 2000, BDA.

[77]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[78]  Raymond J. Mooney,et al.  Relational Learning of Pattern-Match Rules for Information Extraction , 1999, CoNLL.

[79]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[80]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[81]  Kristina Lerman,et al.  Learning the Common Structure of Data , 2000, AAAI/IAAI.

[82]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[83]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[84]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[85]  Nicholas Kushmerick,et al.  Wrapper Induction for Information Extraction , 1997, IJCAI.

[86]  Edward T. O'Neill,et al.  A Methodology for Sampling the World Wide Web , 2001 .

[87]  Matthias Jarke,et al.  Data Warehouse Refreshment , 2000 .

[88]  Richard Y. Wang,et al.  Toward quality data: An attribute-based approach , 2014, Decis. Support Syst..

[89]  James L. Peterson,et al.  Computer programs for detecting and correcting spelling errors , 1980, CACM.