An Extensible Framework for Data Cleaning

Data quality concerns arise when one wants to correct anomalies in a single data source (e.g., duplicate elimination in a file), or when one wants to integrate data coming from multiple sources into a single new data source (e.g., data warehouse construction). Three data quality problems are typically encountered: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyboard errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process.We propose a framework that models a data cleaning application as a directed graph of data transformations. Transformations are divided into four distinct classes: mapping, matching, clustering and merging; and each of them is implemented by a macro-operator. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability to include human interaction explicitly in the process. Finally, we study performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation.