Scheduling strategies for efficient ETL execution

Extract-transform-load (ETL) workflows model the population of enterprise data warehouses with information gathered from a large variety of heterogeneous data sources. ETL workflows are complex design structures that run under strict performance requirements and their optimization is crucial for satisfying business objectives. In this paper, we deal with the problem of scheduling the execution of ETL activities (a.k.a. transformations, tasks, operations), with the goal of minimizing ETL execution time and allocated memory. We investigate the effects of four scheduling policies on different flow structures and configurations and experimentally show that the use of different scheduling policies may improve ETL performance in terms of memory consumption and execution time. First, we examine a simple, fair scheduling policy. Then, we study the pros and cons of two other policies: the first opts for emptying the largest input queue of the flow and the second for activating the operation (a.k.a. activity) with the maximum tuple consumption rate. Finally, we examine a fourth policy that combines the advantages of the latter two in synergy with flow parallelization.

[1]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Kevin Wilkinson,et al.  QoX-driven ETL design: reducing the cost of ETL consulting engagements , 2009, SIGMOD Conference.

[3]  Panos Vassiliadis,et al.  A generic and customizable framework for the design of ETL scenarios , 2005, Inf. Syst..

[4]  Umeshwar Dayal,et al.  Benchmarking ETL Workflows , 2009, TPCTC.

[5]  Rajeev Motwani,et al.  Chain: operator scheduling for memory minimization in data stream systems , 2003, SIGMOD '03.

[6]  Jeffrey F. Naughton,et al.  Transaction Reordering and Grouping for Continuous Data Loading , 2006, BIRTE.

[7]  Kevin Wilkinson,et al.  Data integration flows for business intelligence , 2009, EDBT '09.

[8]  Panos Vassiliadis,et al.  Deciding the physical implementation of ETL workflows , 2007, DOLAP '07.

[9]  Torben Bach Pedersen,et al.  ETLMR: A Highly Scalable Dimensional ETL Framework Based on MapReduce , 2013, Trans. Large Scale Data Knowl. Centered Syst..

[10]  Michael Stonebraker,et al.  Operator Scheduling in a Data Stream Manager , 2003, VLDB.

[11]  Theodore Johnson,et al.  Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[12]  Goetz Graefe,et al.  Query evaluation techniques for large databases , 1993, CSUR.

[13]  Joseph M. Hellerstein,et al.  Eddies: continuously adaptive query processing , 2000, SIGMOD '00.

[14]  Wolfgang Lehner,et al.  Partition-based workload scheduling in living data warehouse environments , 2007, DOLAP '07.

[15]  Kevin Wilkinson,et al.  Optimizing ETL workflows for fault-tolerance , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[16]  Julius T. Tou,et al.  Information Systems , 1973, GI Jahrestagung.

[17]  Panos Vassiliadis,et al.  Near Real Time ETL , 2009, New Trends in Data Warehousing and Data Analysis.

[18]  Paulo Carreira,et al.  Data Mapper: An Operator for Expressing One-to-Many Data Transformations , 2005, DaWaK.

[19]  Michael J. Franklin,et al.  Dynamic Pipeline Scheduling for Improving Interactive Query Performance , 2001, VLDB.

[20]  Timos K. Sellis,et al.  State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[21]  Kirk Pruhs,et al.  Algorithms and metrics for processing multiple heterogeneous continuous queries , 2008, TODS.