论文信息 - DWCLEANSER: A Framework for Approximate Duplicate Detection

DWCLEANSER: A Framework for Approximate Duplicate Detection

Data quality has become a major area of concern in data warehouse. The prime aim of a data warehouse is to store quality data so that it can enhance the decision support systems effectively. Quality of data is improved by employing data cleaning techniques. Data cleaning deals with detecting and removing errors and discrepancies from data. This paper presents a novel framework for detection of exact as well as approximate duplicates in a data warehouse. The proposed approach decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques. In addition, appropriate methods have been framed to manage the outliers and missing values in the datasets. Moreover, comprehensive repositories have been provided that will be useful in incremental data cleaning.

Payal Pahwa | Garima Thakur | Nidhi Tyagi | Manu Singh

[1] J JebamalarTamilselvi,et al. Detection and elimination of duplicate data using token-based method for a data warehouse: a clustering based approach , 2009 .

[2] Payal Pahwa,et al. An Efficient Algorithm for Data Cleaning , 2011, Int. J. Knowl. Based Organ..

[3] Igor Kononenko,et al. Attribute selection for modelling , 1997, Future Gener. Comput. Syst..

[4] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[5] V. Saravanan,et al. A Unified Framework and Sequential Data Cleaning Approach for a Data Warehouse , 2008 .

[6] Joseph M. Hellerstein,et al. Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[7] Alvaro E. Monge,et al. Adaptive detection of approximately duplicate database records and the database integration approach to information discovery , 1998 .

[8] Thomas Redman,et al. The impact of poor data quality on the typical enterprise , 1998, CACM.

[9] Ramez Elmasri,et al. Fundamentals of Database Systems , 1989 .

[10] Saied Haidarian Shahri,et al. Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework , 2006, IEEE Intelligent Systems.