论文信息 - Data Deduplication Techniques and Analysis

Data Deduplication Techniques and Analysis

Data warehouses are the repositories of data collected from several data sources, which form the backbone of most of the decision support applications. As the data sources are independent, they may adopt independent and potentially inconsistent conventions. In data warehousing applications during ETL (Extraction, Transformation and Loading) or even in OLTP (On Line Transaction Processing) applications we are often encountered with duplicate records in table. Moreover, data entry mistakes at any of these sources introduce more errors. Since high quality data is essential for gaining the confidence of users of decision support applications, ensuring high data quality is critical to the success of data warehouse implementations. Therefore, significant amount of time and money are spent on the process of detecting and correcting errors and inconsistencies. The process of cleaning dirty data is often referred to as data cleaning. To make the table data consistent and accurate we need to get rid of these duplicate records from the table. In this paper we discuss different strategies of Deduplication along with their pros and cons and some of methods used to prevent duplication in database. In addition, we have made performance evaluation with Microsoft SQL-Server 2008 on Food Mart and AdventureDB Warehouses.

Srivatsa Maddodi | Girija Attigeri | A. K. Karunakar | Girija V. Attigeri | Srivatsa Maddodi

[1] Surajit Chaudhuri,et al. Leveraging aggregate constraints for deduplication , 2007, SIGMOD '07.

[2] Shamkant B. Navathe,et al. An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[3] Matthias Jarke,et al. Fundamentals of Data Warehouses , 2000, Springer Berlin Heidelberg.

[4] Ramakrishnan Srikant,et al. Mining generalized association rules , 1995, Future Gener. Comput. Syst..

[5] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[6] Charles Elkan,et al. The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[7] Peter Christen,et al. Quality and Complexity Measures for Data Linkage and Deduplication , 2007, Quality Measures in Data Mining.

[8] Erhard Rahm,et al. Data Cleaning: Problems and Current Approaches , 2000, IEEE Data Eng. Bull..

[9] Tova Milo,et al. Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[10] Hongjun Lu,et al. Cleansing Data for Mining and Warehousing , 1999, DEXA.

[11] Surajit Chaudhuri,et al. An overview of data warehousing and OLAP technology , 1997, SGMD.

[12] Daniela Florescu,et al. AJAX: An Extensible Data Cleaning Tool , 2000, SIGMOD Conference.

[13] Serge Abiteboul,et al. Tools for Data Translation and Integration , 1999, IEEE Data Eng. Bull..