Workflow Optimization in PAW

Many industrial applications, from domains such as telecommunication, web and sales, require to perform complex analytics across several data processing systems. The performance of such analytics is usually expressed in workflows, and it is a task that is both labor-intensive and time-consuming. At the same time, with increasing amounts of data to be analysed, the optimization of analytics workflows becomes crucial for satisfying business objectives. This paper focuses on workflow optimization with respect to time efficiency, over multiple execution engines, such as a traditional DBMS, a MapReduce engine, and a scripting engine. This configuration is emerging as a common paradigm used to combine analysis of unstructured and structured data. We propose a novel optimization technique as part of our system called PAW (Platform for Analytics Workflows). This technique creates alternative workflow structures and their execution plans based on equivalent combinations and orders of operators. The technique employs an exhaustive and a heuristic algorithm to search efficiently the space of equivalent workflow structures and select the one with the optimal execution plan. We present a thorough experimental study and we showcase the efficiency of the proposed optimization technique in a fully fledged multi-engine system, applied on three real-world applications and their data, as well as on a synthetic benchmark.

[1]  Christopher D. Carothers,et al.  Toward an End-to-End Framework for Modeling, Monitoring and Anomaly Detection for Scientific Workflows , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[2]  Theodore Johnson,et al.  Scheduling Updates in a Real-Time Stream Warehouse , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[3]  Kevin Wilkinson,et al.  HFMS: Managing the lifecycle and complexity of hybrid analytic data flows , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Volker Markl,et al.  Emma in Action: Declarative Dataflows for Scalable Data Analysis , 2016, SIGMOD Conference.

[5]  Torben Bach Pedersen,et al.  RiTE: Providing On-Demand Data for Right-Time Data Warehousing , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[6]  Volker Markl,et al.  Parallelizing query optimization , 2008, Proc. VLDB Endow..

[7]  Jennifer Widom,et al.  The TSIMMIS Project: Integration of Heterogeneous Information Sources , 1994, IPSJ.

[8]  Kevin Wilkinson,et al.  Optimizing analytic data flows for multiple execution engines , 2012, SIGMOD Conference.

[9]  Felix Naumann,et al.  The Stratosphere platform for big data analytics , 2014, The VLDB Journal.

[10]  Miron Livny,et al.  Pegasus, a workflow management system for science automation , 2015, Future Gener. Comput. Syst..

[11]  Michael Stonebraker,et al.  A Demonstration of the BigDAWG Polystore System , 2015, Proc. VLDB Endow..