Potter''s Wheel: An Interactive Framework for Data Transformation and Cleaning

An important step in data warehousing and Enterprise Data Integration is cleaning data of discrepancies in structure and content. Current commercial solutions for data cleaning involve many iterations of time-consuming “auditing” to find errors, and long-running transformations to fix them. Users need to endure long waits and often write complex transformation programs. In this paper, we present an interactive data cleaning system that tightly integrates transformation and discrepancy detection. Users gradually build transformations by adding or undoing transforms, in a intuitive, graphical manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. In the background, the system automatically infers the structure of the data in terms of user-defined domains and applies suitable algorithms to check the data for discrepancies, flagging them as they are found. This allows users to gradually construct a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. We choose and adapt a small set of transforms from the literature and describe methods for their graphical specification and interactive application. We combine the Minimum Description Length principle with the traditional database notion of user-defined types to automatically extract suitable structures for data values, in an extensible fashion. Such structure extraction is also applied in the graphical specification of transforms, to infer transforms from examples. We also describe methods for optimizing the final sequence of transforms to minimize memory allocations

[1]  Laks V. S. Lakshmanan,et al.  On Efficiently Implementing SchemaSQL on an SQL Database System , 1999, VLDB.

[2]  Andy Chou,et al.  Scalable Spreadsheets for Interactive Data Analysis , 1999, 1999 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[3]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[4]  Michael Kifer,et al.  HILOG: A Foundation for Higher-Order Logic Programming , 1993, J. Log. Program..

[5]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[6]  Serge Abiteboul,et al.  Tools for Data Translation and Integration , 1999, IEEE Data Eng. Bull..

[7]  Michael Stonebraker,et al.  Independent, Open Enterprise Data Integration , 1999, IEEE Data Eng. Bull..

[8]  Laks V. S. Lakshmanan,et al.  Tables as a paradigm for querying and restructuring (extended abstract) , 1996, PODS '96.

[9]  Stéphane Grumbach,et al.  In Search of the Lost Schema , 1999, ICDT.

[10]  Brad Adelberg,et al.  NoDoSE—a tool for semi-automatically extracting structured and semistructured data from text documents , 1998, SIGMOD '98.

[11]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[12]  Ben Shneiderman,et al.  The future of interactive systems and the emergence of direct manipulation , 1982 .

[13]  Jennifer Widom,et al.  Information translation, mediation, and mosaic-based browsing in the TSIMMIS system , 1995, SIGMOD '95.

[14]  Laura M. Haas,et al.  Transforming Heterogeneous Data with Database Middleware: Beyond Integration , 1999, IEEE Data Eng. Bull..

[15]  Michael Kifer,et al.  F-logic: a higher-order language for reasoning about objects, inheritance, and scheme , 1989, SIGMOD '89.

[16]  Joseph M. Hellerstein,et al.  Online Dynamic Reordering for Interactive Data Processing , 1999, VLDB.

[17]  Hannu Toivonen,et al.  Efficient discovery of functional and approximate dependencies using partitions , 1998, Proceedings 14th International Conference on Data Engineering.

[18]  Hongjun Lu,et al.  Cleansing Data for Mining and Warehousing , 1999, DEXA.

[19]  Heikki Mannila,et al.  Algorithms for Inferring Functional Dependencies from Relations , 1994, Data Knowl. Eng..

[20]  Heikki Mannila,et al.  Approximate Dependency Inference from Relations , 1992, ICDT.

[21]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[22]  Laks V. S. Lakshmanan,et al.  SchemaSQL - A Language for Interoperability in Relational Multi-Database Systems , 1996, VLDB.

[23]  Gregory Piatetsky-Shapiro,et al.  The KDD process for extracting useful knowledge from volumes of data , 1996, CACM.

[24]  Patrick Valduriez,et al.  Scaling Access to Heterogeneous Data Sources with DISCO , 1998, IEEE Trans. Knowl. Data Eng..

[25]  Arie Segev,et al.  Using common subexpressions to optimize multiple queries , 1988, Proceedings. Fourth International Conference on Data Engineering.

[26]  Ronald L. Rivest,et al.  Inferring Decision Trees Using the Minimum Description Length Principle , 1989, Inf. Comput..

[27]  Renée J. Miller Using schematically heterogeneous structures , 1998, SIGMOD '98.

[28]  Sibel Adali,et al.  A uniform framework for integrating knowledge in heterogeneous knowledge systems , 1995, Proceedings of the Eleventh International Conference on Data Engineering.

[29]  Kyuseok Shim,et al.  XTRACT: a system for extracting document type descriptors from XML documents , 2000, SIGMOD '00.

[30]  Laura M. Haas,et al.  Towards heterogeneous multimedia information systems: the Garlic approach , 1995, Proceedings RIDE-DOM'95. Fifth International Workshop on Research Issues in Data Engineering-Distributed Object Management.

[31]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[32]  Michael Stonebraker,et al.  Supporting fine-grained data lineage in a database visualization environment , 1997, Proceedings 13th International Conference on Data Engineering.

[33]  Surajit Chaudhuri,et al.  An overview of data warehousing and OLAP technology , 1997, SGMD.