A Pareto-based scheduler for exploring cost-performance trade-offs for MapReduce workloads

In recent years, we are observing an increased demand for processing large amounts of data. The MapReduce programming model has been utilized by major computing companies and has been integrated by novel cyber physical systems (CPS) in order to perform large-scale data processing. However, the problem of efficiently scheduling MapReduce workloads in cluster environments, like Amazon’s EC2, can be challenging due to the observed trade-off between the need for performance and the corresponding monetary cost. The problem is exacerbated by the fact that cloud providers tend to charge users based on their I/O operations, increasing dramatically the spending budget. In this paper, we describe our approach for scheduling MapReduce workloads in cluster environments taking into consideration the performance/budget trade-off. Our approach makes the following contributions: (i) we propose a novel Pareto-based scheduler for identifying near-optimal resource allocations for user workloads with respect to performance and monetary cost, and (ii) we develop an automatic configuration of basic tasks’ parameters that allows us to further minimize the user’s spending budget and the jobs’ execution times. Our detailed experimental evaluation using both real and synthetic datasets illustrate that our approach improves the performance of the workloads as much as 50%, compared to its competitors.

[1]  Tao Ye,et al.  A recursive random search algorithm for large-scale network parameter configuration , 2003, SIGMETRICS '03.

[2]  Thomas Sandholm,et al.  Dynamic Proportional Share Scheduling in Hadoop , 2010, JSSPP.

[3]  Rajkumar Buyya,et al.  Offer-based scheduling of deadline-constrained Bag-of-Tasks applications for utility computing systems , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Yogesh L. Simmhan,et al.  Cloud-Based Software Platform for Big Data Analytics in Smart Grids , 2013, Computing in Science & Engineering.

[5]  Chen Wang,et al.  MRTuner: A Toolkit to Enable Holistic Optimization for MapReduce Jobs , 2014, Proc. VLDB Endow..

[6]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[7]  Dominique Genoud,et al.  Big Data for Cyber Physical Systems: An Analysis of Challenges, Solutions and Opportunities , 2014, 2014 Eighth International Conference on Innovative Mobile and Internet Services in Ubiquitous Computing.

[8]  Boon Thau Loo,et al.  Exploiting cloud heterogeneity for optimized cost/performance MapReduce processing , 2014, CloudDP '14.

[9]  Yang Wang,et al.  Budget-Driven Scheduling Algorithms for Batches of MapReduce Jobs in Heterogeneous Clouds , 2014, IEEE Transactions on Cloud Computing.

[10]  Vana Kalogeraki,et al.  Real-Time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments , 2014, ICAC.

[11]  Roy H. Campbell,et al.  Play It Again, SimMR! , 2011, 2011 IEEE International Conference on Cluster Computing.

[12]  Chita R. Das,et al.  HybridMR: A Hierarchical MapReduce Scheduler for Hybrid Data Centers , 2013, 2013 IEEE 33rd International Conference on Distributed Computing Systems.

[13]  Yogesh L. Simmhan,et al.  Floe: A Continuous Dataflow Framework for Dynamic Cloud Applications , 2014, ArXiv.

[14]  Alexandru Iosup,et al.  ExPERT: Pareto-Efficient Task Replication on Grids and a Cloud , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Rajeev Gandhi,et al.  An Analysis of Traces from a Production MapReduce Cluster , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[16]  Vana Kalogeraki,et al.  A Framework for Cost-Effective Scheduling of MapReduce Applications , 2015, 2015 IEEE International Conference on Autonomic Computing.

[17]  Roy H. Campbell,et al.  Orchestrating an Ensemble of MapReduce Jobs for Minimizing Their Makespan , 2013, IEEE Transactions on Dependable and Secure Computing.

[18]  Insup Lee,et al.  Cyber-physical systems: The next computing revolution , 2010, Design Automation Conference.

[19]  Roy H. Campbell,et al.  Deadline-based workload management for MapReduce environments: Pieces of the performance puzzle , 2012, 2012 IEEE Network Operations and Management Symposium.

[20]  Zhou Silin Cloud-assisted QoE guarantee scheme based on adaptive cross-layer perceptron of artificial neural network for mobile Internet , 2016, EURASIP J. Embed. Syst..

[21]  Shivnath Babu,et al.  Towards automatic optimization of MapReduce programs , 2010, SoCC '10.

[22]  Jouni Lampinen,et al.  GDE3: the third evolution step of generalized differential evolution , 2005, 2005 IEEE Congress on Evolutionary Computation.

[23]  Imad Aad,et al.  From big smartphone data to worldwide research: The Mobile Data Challenge , 2013, Pervasive Mob. Comput..

[24]  Alexandru Iosup,et al.  Balanced resource allocations across multiple dynamic MapReduce clusters , 2014, SIGMETRICS '14.

[25]  Donald Kossmann,et al.  The Skyline operator , 2001, Proceedings 17th International Conference on Data Engineering.

[26]  DebK.,et al.  A fast and elitist multiobjective genetic algorithm , 2002 .

[27]  Weisong Shi,et al.  Workload characterization on a production Hadoop cluster: A case study on Taobao , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[28]  Magdalena Balazinska,et al.  Estimating the progress of MapReduce pipelines , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[29]  A. A. Zhigli︠a︡vskiĭ,et al.  Theory of Global Random Search , 1991 .

[30]  Rajkumar Buyya,et al.  Energy-Efficient Scheduling of Urgent Bag-of-Tasks Applications in Clouds through DVFS , 2014, 2014 IEEE 6th International Conference on Cloud Computing Technology and Science.

[31]  Rajkumar Buyya,et al.  Advanced Reservation-Based Scheduling of Task Graphs on Clusters , 2006, HiPC.

[32]  Xu-qing Chai,et al.  Profit-oriented task scheduling algorithm in Hadoop cluster , 2016, EURASIP J. Embed. Syst..

[33]  Vana Kalogeraki,et al.  ChEsS: Cost-Effective Scheduling Across Multiple Heterogeneous Mapreduce Clusters , 2016, 2016 IEEE International Conference on Autonomic Computing (ICAC).

[34]  Herodotos Herodotou,et al.  Profiling, what-if analysis, and cost-based optimization of MapReduce programs , 2011, Proc. VLDB Endow..

[35]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[36]  Dimitrios Gunopulos,et al.  Intelligent Urban Data Monitoring for Smart Cities , 2016, ECML/PKDD.

[37]  Boon Thau Loo,et al.  Performance Modeling of MapReduce Jobs in Heterogeneous Cloud Environments , 2013, 2013 IEEE Sixth International Conference on Cloud Computing.

[38]  Shie Mannor,et al.  INSIGHT: Dynamic Traffic Management Using Heterogeneous Urban Data , 2016, ECML/PKDD.

[39]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[40]  G. Sudha Sadhasivam,et al.  Improved cost-based algorithm for task scheduling in cloud computing , 2010, 2010 IEEE International Conference on Computational Intelligence and Computing Research.

[41]  Roy H. Campbell,et al.  ARIA: automatic resource inference and allocation for mapreduce environments , 2011, ICAC '11.

[42]  Edward A. Lee Cyber Physical Systems: Design Challenges , 2008, 2008 11th IEEE International Symposium on Object and Component-Oriented Real-Time Distributed Computing (ISORC).

[43]  Dimitrios Gunopulos,et al.  Insights on a Scalable and Dynamic Traffic Management System , 2015, EDBT.

[44]  Seyong Lee,et al.  PUMA: Purdue MapReduce Benchmarks Suite , 2012 .

[45]  Boon Thau Loo,et al.  Exploiting Cloud Heterogeneity to Optimize Performance and Cost of MapReduce Processing , 2015, PERV.

[46]  Kalyanmoy Deb,et al.  A fast and elitist multiobjective genetic algorithm: NSGA-II , 2002, IEEE Trans. Evol. Comput..