论文信息 - HADCLEAN: A hybrid approach to data cleaning in data warehouses

HADCLEAN: A hybrid approach to data cleaning in data warehouses

Data Cleaning is a very important part of the data warehouse management process. It is not a very easy process as many different types of unclean data (bad data, incomplete data, typos, etc) can be present. Also, whether a data is clean or dirty is highly dependent on the nature and source of the raw data. Many attempts have been made to clean the data using blocking algorithms, phonetic algorithms, etc. In this paper an attempt has been made to provide a hybrid approach HADCLEAN for cleaning data which combines modified versions of PNRS and Transitive closure algorithms.

[1] John M. Trenkle,et al. Disambiguation and spelling correction for a neural network based character recognition system , 1994, Electronic Imaging.

[2] Esko Ukkonen,et al. A Comparison of Approximate String Matching Algorithms , 1996, Softw. Pract. Exp..

[3] Roopa Bheemavaram,et al. A Parallel and Distributed Approach for Finding Transitive Closures of Data Records: A Proposal , 2006 .

[4] Karen Kukich,et al. Techniques for automatically correcting words in text , 1992, CSUR.

[5] Wing Ning Li,et al. Efficient Algorithms for Grouping Data to Improve Data Quality , 2006, IKE.

[6] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[7] Russell Deaton,et al. Semantic Data Matching: Principles and Performance , 2009 .

[8] Eric C. Jensen,et al. Retr ieving OCR Text : A Survey of Current Approaches , 2002 .

[9] Klaus U. Schulz,et al. Lexical postcorrection of OCR-results:the web as a dynamic secondary dictionary? , 2003, Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings..

[10] Thomas Redman,et al. The impact of poor data quality on the typical enterprise , 1998, CACM.

[11] Cihan Varol,et al. Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data , 2009 .

[12] Xiaojun Zhang,et al. Transitive Closure of Data Records: Application and Computation , 2009 .