A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse

Summary The data cleaning is the process of identifying and removing the errors in the data warehouse. Data cleaning is very important in data mining process. Most of the organizations are in the need of quality data. The quality of the data needs to be improved in the data warehouse before the mining process. The framework available for data cleaning offers the fundamental services for data cleaning such as attribute selection, formation of tokens, selection of clustering algorithm, selection of similarity function, selection of elimination function and merge function. This research paper deals about the new framework for data cleaning. It also presents a solution to handle data cleaning process by using a new framework design in a sequential order.

[1]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[2]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[3]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis , 2000, IQ.

[4]  Dorian Pyle,et al.  Data Preparation for Data Mining , 1999 .

[5]  Igor Kononenko,et al.  Attribute selection for modelling , 1997, Future Gener. Comput. Syst..

[6]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[7]  Ahmed K. Elmagarmid,et al.  TAILOR: a record linkage toolbox , 2002, Proceedings 18th International Conference on Data Engineering.

[8]  I. Kononenko,et al.  Attribute Selection for Modeling , 1997 .

[9]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[10]  Lifang Gu,et al.  Record Linkage: Current Practice and Future Directions , 2003 .

[11]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[12]  Christie I. Ezeife,et al.  The Use of Smart Tokens in Cleaning Integrated Warehouse Data , 2005, Int. J. Data Warehous. Min..

[13]  Peter Christen,et al.  A Comparison of Fast Blocking Methods for Record Linkage , 2003, KDD 2003.

[14]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[15]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[16]  Joseph M. Hellerstein,et al.  An Interactive Framework for Data Cleaning , 2000 .

[17]  Vassilios S. Verykios,et al.  Record Matching: Past, Present and Future , 2001 .

[18]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[19]  Arthur Chapman,et al.  © 2005, Global Biodiversity Information Facility Material in this publication is free to use, with proper attribution. Recommended citation format: Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. , 2005 .

[20]  Surajit Chaudhuri,et al.  Eliminating Fuzzy Duplicates in Data Warehouses , 2002, VLDB.

[21]  Andrian Marcus,et al.  Data Cleansing: Beyond Integrity Analysis 1 , 2000 .

[22]  Ahmed K. Elmagarmid,et al.  Automating the approximate record-matching process , 2000, Inf. Sci..

[23]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[24]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[25]  Erhard Rahm,et al.  Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[26]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[27]  David J. DeWitt,et al.  Duplicate record elimination in large data files , 1983, TODS.

[28]  Saied Haidarian Shahri,et al.  Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework , 2006, IEEE Intelligent Systems.