A Survey of Extract-Transform-Load Technology

The software processes that facilitate the original loading and the periodic refreshment of the data warehouse contents are commonly known as Extraction-Transformation-Loading (ETL) processes. The intention of this survey is to present the research work in the field of ETL technology in a structured way. To this end, we organize the coverage of the field as follows: (a) first, we cover the conceptual and logical modeling of ETL processes, along with some design methods, (b) we visit each stage of the E-T-L triplet, and examine problems that fall within each of these stages, (c) we discuss problems that pertain to the entirety of an ETL process, and, (d) we review some research prototypes of academic origin. [Article copies are available for purchase from InfoSci-on-Demand.com]

[1]  Hector Garcia-Molina,et al.  Shrinking the warehouse update Window , 1999, SIGMOD '99.

[2]  Juan Trujillo,et al.  A UML Based Approach for Modeling ETL Processes in Data Warehouses , 2003, ER.

[3]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[4]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  José Barateiro,et al.  A Survey of Data Quality Tools , 2005, Datenbank-Spektrum.

[6]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[7]  Ahmed K. Elmagarmid,et al.  Duplicate Record Detection: A Survey , 2007, IEEE Transactions on Knowledge and Data Engineering.

[8]  Paulo Carreira,et al.  One-to-many data transformations through data mappers , 2007, Data Knowl. Eng..

[9]  Ronald Fagin,et al.  Inverting schema mappings , 2006, TODS.

[10]  Yannis Sismanis,et al.  Dwarf: shrinking the PetaCube , 2002, SIGMOD '02.

[11]  Alkis Simitsis,et al.  Mapping conceptual to logical models for ETL processes , 2005, DOLAP '05.

[12]  Mary Roth,et al.  Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources , 1997, VLDB.

[13]  Inderpal Singh Mumick,et al.  The Stanford Data Warehousing Project , 1995 .

[14]  Inderpal Singh Mumick,et al.  Incremental maintenance of aggregate and outerjoin expressions , 2006, Inf. Syst..

[15]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses with CD Rom , 1998 .

[16]  Panos Vassiliadis,et al.  Blueprints for ETL workflows , 2005 .

[17]  Shunji Osaki,et al.  Bulk loading a data warehouse built upon a UB-Tree , 2000, Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789).

[18]  Panos Vassiliadis,et al.  Modeling ETL activities as graphs , 2002, DMDW.

[19]  Goetz Graefe,et al.  PIVOT and UNPIVOT: Optimization and Execution Strategies in an RDBMS , 2004, VLDB.

[20]  Erhard Rahm,et al.  An Integrative and Uniform Model for Metadata Management in Data Warehousing Environments , 1999, DMDW.

[21]  Hamid Pirahesh,et al.  A snapshot differential refresh algorithm , 1986, SIGMOD '86.

[22]  Panos Vassiliadis,et al.  Meshing Streaming Updates with Persistent Data in an Active Data Warehouse , 2008, IEEE Transactions on Knowledge and Data Engineering.

[23]  Yannis Sismanis,et al.  The active MultiSync controller of the cubetree storage organization , 1999, SIGMOD '99.

[24]  Divesh Srivastava,et al.  Approximate Joins: Concepts and Techniques , 2005, VLDB.

[25]  Wolfgang Lehner,et al.  Partition-based workload scheduling in living data warehouse environments , 2007, DOLAP '07.

[26]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[27]  Paulo Carreira,et al.  Data Mapper: An Operator for Expressing One-to-Many Data Transformations , 2005, DaWaK.

[28]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[29]  Paulo Carreira,et al.  On the performance of one-to-many data transformations , 2007, QDB.

[30]  Omar Boussaïd,et al.  DWEB: A Data Warehouse Engineering Benchmark , 2005, DaWaK.

[31]  Ehtisham Zaidi,et al.  Magic Quadrant for Data Integration Tools , 2010 .

[32]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[33]  Panos Vassiliadis,et al.  Conceptual modeling for ETL processes , 2002, DOLAP '02.

[34]  Inderpal Singh Mumick,et al.  Maintenance of data cubes and summary tables in a warehouse , 1997, SIGMOD '97.

[35]  Joseph M. Hellerstein,et al.  Potters Wheel: An interactive framework for data cleaning , 2000 .

[36]  Paulo Carreira,et al.  One-to-many data transformation operations - optimization and execution on an RDBMS , 2007, ICEIS.

[37]  Yannis Kotidis,et al.  Aggregate view management in data warehouses , 2002 .

[38]  Dimitrios Skoutas,et al.  Ontology-Based Conceptual Design of ETL Processes for Both Structured and Semi-Structured Data , 2007, Int. J. Semantic Web Inf. Syst..

[39]  Jennifer Widom,et al.  Lineage tracing for general data warehouse transformations , 2003, The VLDB Journal.

[40]  Panos Vassiliadis,et al.  Data Mapping Diagrams for Data Warehouse Design with UML , 2004, ER.

[41]  Nick Roussopoulos,et al.  Cubetree: Organization of and Bulk Updates on the Data Cube , 1997, SIGMOD Conference.

[42]  Panos Vassiliadis,et al.  A method for the mapping of conceptual designs to logical blueprints for ETL processes , 2008, Decis. Support Syst..

[43]  Joseph M. Hellerstein,et al.  An Interactive Framework for Data Cleaning and Transformation , 1999 .

[44]  Felix Naumann,et al.  FuSem - Exploring Different Semantics of Data Fusion , 2007, VLDB.

[45]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[46]  Jeffrey F. Naughton,et al.  Transaction Reordering and Grouping for Continuous Data Loading , 2006, BIRTE.

[47]  Hector Garcia-Molina,et al.  Efficient resumption of interrupted warehouse loads , 2000, SIGMOD '00.

[48]  Felix Naumann,et al.  Data Fusion in Three Steps: Resolving Schema, Tuple, and Value Inconsistencies , 2006, IEEE Data Eng. Bull..

[49]  Ronald Fagin,et al.  Data exchange: semantics and query answering , 2003, Theor. Comput. Sci..

[50]  Evaggelia Pitoura,et al.  ETL queues for active data warehousing , 2005, IQIS '05.

[51]  Nick Roussopoulos,et al.  Materialized views and data warehouses , 1998, SGMD.

[52]  Hector Garcia-Molina,et al.  Efficient Snapshot Differential Algorithms for Data Warehousing , 1996, VLDB.

[53]  Panos Vassiliadis,et al.  Towards a Benchmark for ETL Workflows , 2007, QDB.

[54]  Dennis Shasha,et al.  Declarative Data Cleaning: Language, Model, and Algorithms , 2001, VLDB.

[55]  Anthony Kosky,et al.  Specifying Database Transformations in WOL , 1999, IEEE Data Eng. Bull..

[56]  Bill Hamilton Sql server integration services , 2007 .

[57]  Kenneth A. Ross,et al.  Supporting multiple view maintenance policies , 1997, SIGMOD '97.

[58]  Dennis Shasha,et al.  AJAX: an extensible data cleaning tool , 2000, SIGMOD '00.

[59]  Matthias Jarke,et al.  Data warehouse process management , 2001, Inf. Syst..

[60]  Dimitrios Skoutas,et al.  Designing ETL processes using semantic web technologies , 2006, DOLAP '06.

[61]  Dimitrios Skoutas,et al.  Flexible and Customizable NL Representation of Requirements for ETL processes , 2007, NLDB.

[62]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[63]  Vincent Y. Lum,et al.  EXPRESS: a data EXtraction, Processing, and Restructuring System , 1977, TODS.

[64]  Panos Vassiliadis,et al.  Graph-Based Modeling of ETL Activities with Multi-level Transformations and Updates , 2005, DaWaK.

[65]  Felix Naumann,et al.  Declarative Data Fusion - Syntax, Semantics, and Implementation , 2005, ADBIS.

[66]  Joseph M. Hellerstein,et al.  Potter's Wheel: An Interactive Data Cleaning System , 2001, VLDB.

[67]  Dennis Shasha,et al.  An extensible Framework for Data Cleaning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[68]  Panos Vassiliadis,et al.  A Framework for the Design of ETL Scenarios , 2003, CAiSE.

[69]  Divyakant Agrawal,et al.  Modeling and Maintaining Multi-View Data Warehouses , 1999, ER.