Data Cleansing: A Prelude to Knowledge Discovery

This chapter analyzes the problem of data cleansing and the identification of potential errors in data sets. The differing views of data cleansing are surveyed and reviewed and a brief overview of existing data cleansing tools is given. A general framework of the data cleansing process is presented as well as a set of general methods that can be used to address the problem. The applicable methods include statistical outlier detection, pattern matching,clustering, and Data Mining techniques. The experimental results of applying these methods to a real world data set are also given. Finally, research directions necessary to further address the data cleansing problem are discussed.

[1]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[2]  Thomas Redman,et al.  The impact of poor data quality on the typical enterprise , 1998, CACM.

[3]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[4]  Tok Wang Ling,et al.  A New Efficient Data Cleansing Method , 2002, DEXA.

[5]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[6]  Ronald J. Brachman,et al.  The Process of Knowledge Discovery in Databases , 1996, Advances in Knowledge Discovery and Data Mining.

[7]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[8]  Richard Y. Wang,et al.  Data Quality , 2000, Advances in Database Systems.

[9]  Ken Orr,et al.  Data quality and systems theory , 1998, CACM.

[10]  Padhraic Smyth,et al.  From Data Mining to Knowledge Discovery: An Overview , 1996, Advances in Knowledge Discovery and Data Mining.

[11]  Parag A. Pathak,et al.  Massachusetts Institute of Technology , 1964, Nature.

[12]  Martine Cadot,et al.  A data cleaning solution by Perl scripts for the KDD Cup 2003 task 2 , 2003, SKDD.

[13]  Evangelos Simoudis,et al.  Using Recon for Data Cleaning , 1995, KDD.

[14]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[15]  Yiming Yang,et al.  Learning approaches for detecting and tracking news events , 1999, IEEE Intell. Syst..

[16]  Charles E. Heckler,et al.  Applied Multivariate Statistical Analysis , 2005, Technometrics.

[17]  Christos Faloutsos,et al.  Ratio Rules: A New Paradigm for Fast, Quantifiable Data Mining , 1998, VLDB.

[18]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[19]  Andrian Marcus,et al.  Ordinal association rules for error identification in data sets , 2001, CIKM '01.

[20]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[21]  Ramakrishnan Srikant,et al.  Mining quantitative association rules in large relational tables , 1996, SIGMOD '96.

[22]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[23]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[24]  Ralph Kimball,et al.  Dealing with dirty data , 1996 .

[25]  Lior Rokach,et al.  Improving Supervised Learning by Feature Decomposition , 2002, FoIKS.

[26]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[27]  Gregory Piatetsky-Shapiro,et al.  Summary from the KDD-03 panel: data mining: the next 10 years , 2003, SKDD.

[28]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[29]  Diane M. Strong,et al.  Data quality in context , 1997, CACM.

[30]  Giri Kumar Tayi,et al.  Enhancing data quality in data warehouse environments , 1999, CACM.

[31]  Veda C. Storey,et al.  A Framework for Analysis of Data Quality Research , 1995, IEEE Trans. Knowl. Data Eng..

[32]  FoxChristopher,et al.  The notion of data and its quality dimensions , 1994 .

[33]  Zhengxin Chen,et al.  Duplicate detection using k-way sorting method , 2000, SAC '00.

[34]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[35]  Zhao Li,et al.  A fast filtering scheme for large database cleansing , 2002, CIKM '02.

[36]  Anany Levitin,et al.  A model of the data (life) cycles with application to quality , 1993, Inf. Softw. Technol..

[37]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[38]  Anany Levitin,et al.  The Notion of Data and Its Quality Dimensions , 1994, Inf. Process. Manag..

[39]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[40]  Tok Wang Ling,et al.  IntelliClean: a knowledge-based intelligent data cleaner , 2000, KDD '00.

[41]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[42]  Ramakrishnan Srikant,et al.  Mining Association Rules with Item Constraints , 1997, KDD.

[43]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[44]  Mario A. Bochicchio,et al.  Data Cleansing for Fiscal Services: The Taviano Project , 2003, ICEIS.

[45]  M. I. Svanks Integrity analysis: methods for automating data quality assurance , 1988 .

[46]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[47]  Doheon Lee,et al.  A Taxonomy of Dirty Data , 2004, Data Mining and Knowledge Discovery.

[48]  Tamraparni Dasu,et al.  Data quality through knowledge engineering , 2003, KDD '03.