Towards generating ETL processes for incremental loading

Extract, Transform, and Load (ETL) processes physically integrate data from multiple, heterogeneous sources in a central repository referred to as data warehouse. Physically integrated data gets stale when source data is changed, hence periodic refreshes are required. For efficiency reasons data warehouses are typically refreshed incrementally, i.e. changes are captured at the sources and propagated to the data warehouse on a regular basis. Dedicated ETL processes referred to as incremental load processes are employed to extract changes from the sources, propagate the changes, and refresh the data warehouse incrementally. Changes required in the data warehouse are inferred from changes captured at the sources during change propagation. The creation of incremental load processes is a complex task reserved to trained ETL programmers. In this paper we review existing Change Data Capture (CDC) techniques and discuss limitations of different approaches. We further review existing techniques for refreshing data warehouses. We then present an approach for generating incremental load processes from abstract schema mappings.

[1]  Antoni Olivé,et al.  A Method for Change Computation in Deductive Databases , 1992, VLDB.

[2]  Per-Åke Larson,et al.  Updating derived relations: detecting irrelevant and autonomously computable updates , 1986, VLDB.

[3]  Gio Wiederhold,et al.  Incremental Recomputation of Active Relational Expressions , 1991, IEEE Trans. Knowl. Data Eng..

[4]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[5]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[6]  Ambuj K. Singh,et al.  Efficient view maintenance at data warehouses , 1997, SIGMOD '97.

[7]  Dallan Quass,et al.  Maintenance Expressions for Views with Aggregation , 1996, VIEWS.

[8]  Leonid Libkin,et al.  Incremental maintenance of views with duplicates , 1995, SIGMOD '95.

[9]  Panos Vassiliadis,et al.  Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates , 2005, DaWaK.

[10]  Ryan Wisnesky,et al.  Orchid: Integrating Schema Mapping and ETL , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Ralph Kimball,et al.  The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling , 1996 .

[12]  F. E. A Relational Model of Data Large Shared Data Banks , 2000 .

[13]  Ralph Kimball,et al.  The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleaning, Conforming, and Delivering Data , 2004 .

[14]  H. V. Jagadish,et al.  Data Integration using Self-Maintainable Views , 1996, EDBT.

[15]  Hamid Pirahesh,et al.  Incremental Maintenance for Non-Distributive Aggregate Functions , 2002, VLDB.

[16]  Hector Garcia-Molina,et al.  Comparing Very Large Database Snapshots , 1995 .

[17]  Mukesh K. Mohania,et al.  Incremental Maintenance of Materialized Views , 1997, DEXA.

[18]  Alkis Simitsis,et al.  Mapping conceptual to logical models for ETL processes , 2005, DOLAP '05.

[19]  Inderpal Singh Mumick,et al.  Incremental maintenance of aggregate and outerjoin expressions , 2006, Inf. Syst..

[20]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[21]  Alkis Simitsis,et al.  Modeling and managing ETL processes , 2003, VLDB PhD Workshop.