DWCLEANSER: A Framework for Approximate Duplicate Detection

Data quality has become a major area of concern in data warehouse. The prime aim of a data warehouse is to store quality data so that it can enhance the decision support systems effectively. Quality of data is improved by employing data cleaning techniques. Data cleaning deals with detecting and removing errors and discrepancies from data. This paper presents a novel framework for detection of exact as well as approximate duplicates in a data warehouse. The proposed approach decreases the complexity involved in the previously designed frameworks by providing efficient data cleaning techniques. In addition, appropriate methods have been framed to manage the outliers and missing values in the datasets. Moreover, comprehensive repositories have been provided that will be useful in incremental data cleaning.