An extensible Framework for Data Cleaning

Data integration solutions dealing with large amounts of data have been strongly required in the last few years. Besides the traditional data integration problems (e.g. schema integration, local to global schema mappings), three additional data problems have to be dealt with: (1) the absence of universal keys across different databases that is known as the object identity problem, (2) the existence of keyborad errors in the data, and (3) the presence of inconsistencies in data coming from multiple sources. Dealing with these problems is globally called the data cleaning process. In this work, we propose a framework which offers the fundamental services required by this process: data transformation, duplicate elimination and multi-table matching. These services are implemented using a set of purposely designed macro-operators. Moreover, we propose an SQL extension for specifying each of the macro-operators. One important feature of the framework is the ability of explicitly including the human interaction in the process. The main novelty of the work is that the framework permits the following performance optimizations which are tailored for data cleaning applications: mixed evaluation, neighborhood hash join, decision push-down and short-circuited computation. We measure the benefits of each.

[1]  Alon Y. Halevy,et al.  Reasoning with Aggregation Constraints , 1996, EDBT.

[2]  William W. Cohen Some Practical Observations on Integration of Web Information , 1999, WebDB.

[3]  Jeffrey D. Ullman,et al.  MedMaker: a mediation system based on declarative specifications , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[4]  Arnon Rosenthal,et al.  Metadata Propagation in Large, Multi-Layer Database Systems , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[5]  Jeremy A. Hylton,et al.  Identifying and Merging Related Bibliographic Records , 1996 .

[6]  William W. Cohen Integration of heterogeneous databases without common domains using queries based on textual similarity , 1998, SIGMOD '98.

[7]  Sophie Cluet,et al.  Your mediators need data conversion! , 1998, SIGMOD '98.

[8]  Philip A. Bernstein,et al.  Meta-Data Support for Data Transformations Using Microsoft Repository , 1999, IEEE Data Eng. Bull..

[9]  Arnon Rosenthal,et al.  Data Integration in the Large: The Challenge of Reuse , 1994, VLDB.

[10]  Diego Calvanese,et al.  A Principled Approach to Data Integration and Reconciliation in Data Warehousing , 1999, DMDW.

[11]  Umeshwar Dayal,et al.  Processing Queries Over Generalization Hierarchies in a Multidatabase System , 1983, VLDB.

[12]  Dennis Shasha,et al.  New techniques for best-match retrieval , 1990, TOIS.

[13]  Arnon Rosenthal,et al.  Using semantic values to facilitate interoperability among heterogeneous information systems , 1994, TODS.

[14]  Karen Kukich,et al.  Techniques for automatically correcting words in text , 1992, CSUR.

[15]  Joseph M. Hellerstein,et al.  Optimization techniques for queries with expensive methods , 1998, TODS.

[16]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[17]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[18]  Charles Elkan,et al.  An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records , 1997, DMKD.

[19]  Tova Milo,et al.  Using Schema Matching to Simplify Heterogeneous Data Translation , 1998, VLDB.

[20]  Arnon Rosenthal,et al.  A Metadata Resource to Promote Data Integration , 1999 .