Mix ‘n’ match multi-engine analytics

Current platforms fail to efficiently cope with the data and task heterogeneity of modern analytics workflows due to their adhesion to a single data and/or compute model. As a remedy, we present IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments. IReS is able to optimize a workflow with respect to a user-defined policy relying on cost and performance models of the required tasks over the available platforms. This optimization consists in allocating distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and deciding on the exact amount of resources provisioned. Our current prototype supports 5 compute and 3 data engines, yet new ones can effortlessly be added to IReS by virtue of its engine-agnostic mechanisms. Our extensive experimental evaluation confirms that IReS speeds up diverse and realistic workflows by up to 30% compared to their optimal single-engine plan by automatically scattering parts of them to different execution engines and datastores. Its optimizer incurs only marginal overhead to the workflow execution performance, managing to discover the optimal execution plan within a few seconds, even for large-scale workflow instances.

[1]  Herodotos Herodotou,et al.  Stubby: A Transformation-based Optimizer for MapReduce Workflows , 2012, Proc. VLDB Endow..

[2]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[3]  Yaochu Jin,et al.  Surrogate-assisted evolutionary computation: Recent advances and future challenges , 2011, Swarm Evol. Comput..

[4]  Schahram Dustdar,et al.  Composable cost estimation and monitoring for computational applications in cloud computing environments , 2010, ICCS.

[5]  Chita R. Das,et al.  HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[6]  Patrick Valduriez,et al.  CloudMdsQL: querying heterogeneous cloud data stores with a common language , 2016, Distributed and Parallel Databases.

[7]  Yannis Papakonstantinou,et al.  The SQL++ Semi-structured Data Model and Query Language: A Capabilities Survey of SQL-on-Hadoop, NoSQL and NewSQL Databases , 2014, ArXiv.

[8]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[9]  Steven Hand,et al.  Musketeer: all for one, one for all in data processing systems , 2015, EuroSys.

[11]  Kevin Wilkinson,et al.  HFMS: Managing the lifecycle and complexity of hybrid analytic data flows , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[12]  Mei-Hui Su,et al.  Characterization of scientific workflows , 2008, 2008 Third Workshop on Workflows in Support of Large-Scale Science.

[13]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[14]  Dimitrios Tsoumakos,et al.  PANIC: Modeling Application Performance over Virtualized Resources , 2015, 2015 IEEE International Conference on Cloud Engineering.

[15]  Dimitrios Tsoumakos,et al.  The Case for Multi-Engine Data Analytics , 2013, Euro-Par Workshops.

[16]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[17]  Boon Thau Loo,et al.  Automated profiling and resource management of pig programs for meeting service level objectives , 2012, ICAC '12.

[18]  Riccardo Torlone,et al.  QUEPA: QUerying and Exploring a Polystore by Augmentation , 2016, SIGMOD Conference.

[19]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..

[20]  Paolo Papotti,et al.  Rheem: Enabling Multi-Platform Task Execution , 2016, SIGMOD Conference.

[21]  Dimitrios Tsoumakos,et al.  IReS: Intelligent, Multi-Engine Resource Scheduler for Big Data Analytics Workflows , 2015, SIGMOD Conference.

[22]  Michael Stonebraker,et al.  The BigDAWG Polystore System , 2015, SGMD.

[23]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.