D-SPACE4Cloud: Towards Quality-Aware Data Intensive Applications in the Cloud

—The last years witnessed a steep rise in data generation worldwide and, consequently, the widespread adoption of software solutions claiming to support data intensive applications. Competitiveness and innovation have strongly benefited from these new platforms and methodologies, and there is a great deal of interest around the new possibilities that Big Data analytics promise to make reality. Many companies currently engage in data intensive processes as part of their core businesses; however, fully embracing the data-driven paradigm is still cumbersome, and establishing a production-ready, fine-tuned deployment is time-consuming, expensive, and resource-intensive. This situation calls for novel models and techniques to streamline the process of deployment configuration for Big Data applications. In particular, the focus in this paper is on the rightsizing of Cloud deployed clusters, which represent a cost-effective alternative to installation on premises. We propose a novel tool, integrated in a wider DevOps-inspired approach, implementing a parallel and distributed simulation-optimization technique that efficiently and effectively explores the space of alternative resource configurations, seeking the minimum cost deployment that satisfies predefined quality of service constraints. The validity and relevance of the proposed solution has been thoroughly validated in a vast experimental campaign including different applications and Big Data platforms.

[1]  Helmut Krcmar,et al.  Model-Based Performance Evaluation of Batch and Stream Applications for Big Data , 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS).

[2]  Valentin Dalibard,et al.  BOAT: Building Auto-Tuners with Structured Bayesian Optimization , 2017, WWW.

[3]  Minlan Yu,et al.  CherryPick: Adaptively Unearthing the Best Cloud Configurations for Big Data Analytics , 2017, NSDI.

[4]  Roberto Bruni,et al.  PEPA - Performance Evaluation Process Algebra , 2017 .

[5]  Danilo Ardagna,et al.  Generalized Nash Equilibria for the Service Provisioning Problem in Multi-Cloud Systems , 2017, IEEE Transactions on Services Computing.

[6]  Eugenio Gianniti,et al.  Modeling Performance of Hadoop Applications: A Journey from Queueing Networks to Stochastic Well Formed Nets , 2016, ICA3PP.

[7]  Alessandro Maria Rizzi,et al.  Support vector regression model for BigData systems , 2016, ArXiv.

[8]  Eugenio Gianniti,et al.  A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Cloud Environment , 2016, 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC).

[9]  Elisabetta Di Nitto,et al.  Model-driven continuous deployment for quality DevOps , 2016, QUDOS@ISSTA.

[10]  Li Zhang,et al.  Stage Aware Performance Modeling of DAG Based in Memory Analytic Platforms , 2016, 2016 IEEE 9th International Conference on Cloud Computing (CLOUD).

[11]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[12]  Mingfa Zhu,et al.  Minimizing Interference and Maximizing Progress for Hadoop Virtual Machines , 2015, PERV.

[13]  Alessandro Maria Rizzi,et al.  Optimal Map Reduce Job Capacity Allocation in Cloud Systems , 2015, PERV.

[14]  Boon Thau Loo,et al.  Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing , 2015, PERV.

[15]  Dana Petcu,et al.  DICE: Quality-Driven Development of Data-Intensive Cloud Applications , 2015, 2015 IEEE/ACM 7th International Workshop on Modeling in Software Engineering.

[16]  Steffen Becker,et al.  Quantitative Evaluation of Model-Driven Performance Analysis and Simulation of Component-Based Architectures , 2015, IEEE Transactions on Software Engineering.

[17]  Roy H. Campbell,et al.  Profiling and evaluating hardware choices for MapReduce environments: An application-aware approach , 2014, Perform. Evaluation.

[18]  Vipin Kumar,et al.  Trends in big data analytics , 2014, J. Parallel Distributed Comput..

[19]  Mauro Iacono,et al.  Exploiting mean field analysis to model performances of big data architectures , 2014, Future Gener. Comput. Syst..

[20]  Jignesh M. Patel,et al.  Big data and its technical challenges , 2014, CACM.

[21]  Evgenia Smirni,et al.  Optimizing Power and Performance Trade-offs of MapReduce Job Processing with Heterogeneous Multi-core Processors , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[22]  Christina Delimitrou,et al.  Quasar: resource-efficient and QoS-aware cluster management , 2014, ASPLOS.

[23]  Minghong Lin,et al.  Joint optimization of overlapping phases in MapReduce , 2013, PERV.

[24]  José Luis Vázquez-Poletti,et al.  Provisioning data analytic workloads in a cloud , 2013, Future Gener. Comput. Syst..

[25]  Lars Grunske,et al.  Software Architecture Optimization Methods: A Systematic Literature Review , 2013, IEEE Transactions on Software Engineering.

[26]  Christina Delimitrou,et al.  Paragon: QoS-aware scheduling for heterogeneous datacenters , 2013, ASPLOS '13.

[27]  Kevin Wilkinson,et al.  Analytical Performance Models for MapReduce Workloads , 2012, International Journal of Parallel Programming.

[28]  Boon Thau Loo,et al.  Automated profiling and resource management of pig programs for meeting service level objectives , 2012, ICAC '12.

[29]  L. S. S. Reddy,et al.  Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments , 2012, ArXiv.

[30]  Yon Dohn Chung,et al.  Parallel data processing with MapReduce: a survey , 2012, SGMD.

[31]  Stephen Gilmore,et al.  Scalable Differential Analysis of Process Algebra Models , 2012, IEEE Transactions on Software Engineering.

[32]  Daniel A. Menascé,et al.  Queuing Network Models to Predict the Completion Time of the Map Phase of MapReduce Jobs , 2012, Int. CMG Conference.

[33]  Insup Lee,et al.  An empirical analysis of scheduling techniques for real-time cloud-based data processing , 2011, 2011 IEEE International Conference on Service-Oriented Computing and Applications (SOCA).

[34]  Herodotos Herodotou,et al.  No one (cluster) size fits all: automatic cluster sizing for data-intensive analytics , 2011, SoCC.

[35]  Keke Chen,et al.  Towards Optimal Resource Provisioning for Running MapReduce Programs in Public Clouds , 2011, 2011 IEEE 4th International Conference on Cloud Computing.

[36]  Heiko Koziolek,et al.  PerOpteryx: automated application of tactics in multi-objective software architecture optimization , 2011, QoSA-ISARCS '11.

[37]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[38]  Liang Dong,et al.  Starfish: A Self-tuning System for Big Data Analytics , 2011, CIDR.

[39]  Magdalena Balazinska,et al.  ParaTimer: a progress indicator for MapReduce DAGs , 2010, SIGMOD Conference.

[40]  Malgorzata Steinder,et al.  Performance-driven task co-scheduling for MapReduce environments , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[41]  Magdalena Balazinska,et al.  Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[42]  Giuseppe Serazzi,et al.  JMT: performance engineering tools for system modeling , 2009, PERV.

[43]  Steffen Becker,et al.  The Palladio component model for model-driven performance prediction , 2009, J. Syst. Softw..

[44]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .