On Optimal Budget-Driven Scheduling Algorithms for MapReduce Jobs in the Heterogeneous Cloud

In this paper, we consider task-level scheduling algorithms with res-pect to budget and deadline constraint s for a bag of MapReduce jobs on a set of provisioned heterogeneous (virtual) machines in cloud platforms. Heterogeneity is manifested in the ”pay-as-you-go” charging model we use, where service machines with different performance have different service rates. We organize the bag of jobs as aκ-stage workflow and achieve, for specific optimization goals, the following results. First, given a total monetary budget Bj for a particular stage j, we propose a greedy algorithm for distributing the budget, with minimal stage execution timeas our goal. Based on the structure of this problem, we further prove the optimality of our algorithm in terms of the budget used and the execution time achieved. We then combine this algorithm with dynamic programming techniques to propose an optimal scheduling algorithm that obtains a minimum scheduling length inO(κB). The algorithm is efficient if the total budget B is polynomially bounded by the number of tasks in the MapReduce jobs, which is usually the case in practice. Second, we consider the dual of this optimization problem to minimize the cost when the (time) deadline of the computatio n D is fixed. We convert this problem into the standard multiplechoice knapsack problem via a parallel transformation. Our empirical studies verify the proposed optimal algorithms. Keywords-Heterogeneous Clouds, MapReduce optimization, optimal Hadoop scheduling algorithm, budget constraints

[1]  David Pisinger A minimal algorithm for the Multiple-choice Knapsack Problem , 1995 .

[2]  Rajkumar Buyya,et al.  Scheduling scientific workflow applications with deadline and budget constraints using genetic algorithms , 2006, Sci. Program..

[3]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[4]  Jimeng Sun,et al.  DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[5]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[6]  Ewa Deelman,et al.  The cost of doing science on the cloud: the Montage example , 2008, HiPC 2008.

[7]  G. Bruce Berriman,et al.  On the Use of Cloud Computing for Scientific Workflows , 2008, 2008 IEEE Fourth International Conference on eScience.

[8]  Hai Jin,et al.  CLOUDLET: towards mapreduce implementation on virtual machines , 2009, HPDC '09.

[9]  Hai Jin,et al.  Evaluating MapReduce on Virtual Machines: The Hadoop Case , 2009, CloudCom.

[10]  Domenico Talia,et al.  Enabling Reliable MapReduce Applications in Dynamic Cloud Infrastructures , 2010, ERCIM News.

[11]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[12]  Kemafor Anyanwu,et al.  Scheduling Hadoop Jobs to Meet Deadlines , 2010, 2010 IEEE Second International Conference on Cloud Computing Technology and Science.

[13]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[14]  Huan Liu,et al.  Cloud MapReduce: A MapReduce Implementation on Top of a Cloud Operating System , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[15]  Bo Yang,et al.  Automatic task slots assignment in Hadoop MapReduce , 2011, ASBD '11.

[16]  Murali S. Kodialam,et al.  Scheduling in mapreduce-like systems for fast completion time , 2011, 2011 Proceedings IEEE INFOCOM.

[17]  Jiun-Long Huang,et al.  A load-aware scheduler for MapReduce framework in heterogeneous cloud environments , 2011, SAC '11.

[18]  Ying Li,et al.  A Power-Aware Scheduling of MapReduce Applications in the Cloud , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[19]  G. Bruce Berriman,et al.  An Evaluation of the Cost and Performance of Scientific Workflows on Amazon EC2 , 2012, Journal of Grid Computing.

[20]  Miguel Correia,et al.  On the Feasibility of Byzantine Fault-Tolerant MapReduce in Clouds-of-Clouds , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[21]  Xiaorong Li,et al.  ScaleStar: Budget Conscious Scheduling Precedence-Constrained Many-task Workflow Applications in Cloud , 2012, 2012 IEEE 26th International Conference on Advanced Information Networking and Applications.

[22]  Eddy Caron,et al.  Budget Constrained Resource Allocation for Non-deterministic Workflows on an IaaS Cloud , 2012, ICA3PP.

[23]  Seung-Jong Park,et al.  Network-aware scheduling of mapreduce framework ondistributed clusters over high speed networks , 2012, FederatedClouds '12.

[24]  L. S. S. Reddy,et al.  Survey on Improved Scheduling in Hadoop MapReduce in Cloud Environments , 2012, ArXiv.

[25]  Ke Chen,et al.  Survey of MapReduce frame operation in bioinformatics , 2013, Briefings Bioinform..