QoX-driven ETL design: reducing the cost of ETL consulting engagements

As business intelligence becomes increasingly essential for organizations and as it evolves from strategic to operational, the complexity of Extract-Transform-Load (ETL) processes grows. In consequence, ETL engagements have become very time consuming, labor intensive, and costly. At the same time, additional requirements besides functionality and performance need to be considered in the design of ETL processes. In particular, the design quality needs to be determined by an intricate combination of different metrics like reliability, maintenance, scalability, and others. Unfortunately, there are no methodologies, modeling languages or tools to support ETL design in a systematic, formal way for achieving these quality requirements. The current practice handles them with ad-hoc approaches only based on designers' experience. This results in either poor designs that do not meet the quality objectives or costly engagements that require several iterations to meet them. A fundamental shift that uses automation in the ETL design task is the only way to reduce the cost of these engagements while obtaining optimal designs. Towards this goal, we present a novel approach to ETL design that incorporates a suite of quality metrics, termed QoX, at all stages of the design process. We discuss the challenges and tradeoffs among QoX metrics and illustrate their impact on alternative designs.

[1]  Timos K. Sellis,et al.  Multiple-query optimization , 1988, TODS.

[2]  Prasan Roy,et al.  Efficient and extensible algorithms for multi query optimization , 1999, SIGMOD '00.

[3]  S. Sudarshan,et al.  Pipelining in multi-query optimization , 2001, PODS '01.

[4]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[5]  Julio Cesar Sampaio do Prado Leite,et al.  On Non-Functional Requirements in Software Engineering , 2009, Conceptual Modeling: Foundations and Applications.

[6]  Timos K. Sellis,et al.  Optimizing ETL processes in data warehouses , 2005, 21st International Conference on Data Engineering (ICDE'05).

[7]  Stephen R. Gardner Building the data warehouse , 1998, CACM.

[8]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[9]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit , 2009 .

[10]  Theodore Johnson,et al.  Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[11]  Panos Vassiliadis,et al.  Blueprints and Measures for ETL Workflows , 2005, ER.

[12]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[13]  George Papastefanatos,et al.  Policy-Regulated Management of ETL Evolution , 2009, J. Data Semant..

[14]  Panos Vassiliadis,et al.  Supporting Streaming Updates in an Active Data Warehouse , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[15]  Ralph Kimball,et al.  The Data Warehouse Lifecycle Toolkit: Expert Methods for Designing, Developing and Deploying Data Warehouses with CD Rom , 1998 .

[16]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.