论文信息 - Revisiting ETL Benchmarking: The Case for Hybrid Flows

Revisiting ETL Benchmarking: The Case for Hybrid Flows

Modern business intelligence systems integrate a variety of data sources using multiple data execution engines. A common example is the use of Hadoop to analyze unstructured text and merging the results with relational database queries over a data warehouse. These analytic data flows are generalizations of ETL flows. We refer to multi-engine data flows as hybrid flows. In this paper, we present our benchmark infrastructure for hybrid flows and illustrate its use with an example hybrid flow. We then present a collection of parameters to describe hybrid flows. Such parameters are needed to define and run a hybrid flows benchmark. An inherent difficulty in benchmarking ETL flows is the diversity of operators offered by ETL engines. However, a commonality for all engines is extract and load operations, operations which rely on data and function shipping. We propose that by focusing on these two operations for hybrid flows, it may be feasible to revisit the ETL benchmark effort and thus, enable comparison of flows for modern business intelligence applications. We believe our framework may be a useful step toward an industry standard benchmark for ETL flows.

Kevin Wilkinson | Alkis Simitsis

[1] Daniel Pol,et al. Principles for an ETL Benchmark , 2009, TPCTC.

[2] Lieven Eeckhout,et al. Performance Evaluation and Benchmarking , 2005 .

[3] Weimin Du,et al. Query Optimization in a Heterogeneous DBMS , 1992, VLDB.

[4] Laura M. Haas,et al. The Garlic project , 1996, SIGMOD '96.

[5] ZhaoHui Tang,et al. Calibrating the Query Optimizer Cost Model of IRO-DB, an Object-Oriented Federated Database System , 1996, VLDB.

[6] Umeshwar Dayal,et al. Processing Queries Over Generalization Hierarchies in a Multidatabase System , 1983, VLDB.

[7] Volker Markl,et al. A learning optimizer for a federated database management system , 2005, Informatik - Forschung und Entwicklung.

[8] Umeshwar Dayal,et al. Benchmarking ETL Workflows , 2009, TPCTC.

[9] Cláudio T. Silva,et al. Managing the Evolution of Dataflows with VisTrails , 2006, 22nd International Conference on Data Engineering Workshops (ICDEW'06).

[10] Kevin Wilkinson,et al. Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[11] Timos K. Sellis,et al. State-space optimization of ETL workflows , 2005, IEEE Transactions on Knowledge and Data Engineering.

[12] Patrick Valduriez,et al. Validating mediator cost models with DISCO , 1999 .