Dependency-Aware and Resource-Efficient Scheduling for Heterogeneous Jobs in Clouds

Data analytics frameworks shift towards larger degrees of parallelism. Efficient scheduling of data-parallel jobs (tasks) is critical for improving job performance such as response time, and resource utilization. It is an important challenge for large scale data analytics frameworks in which jobs are more complex and have diverse characteristics (e.g., diverse resource requirements). Prior work on scheduling cannot achieve low response time and high resource utilization simultaneously because they cannot accurately estimate the durations of tasks in the queue of a worker machine by using sampling-based approach (including sampling with late binding) for task placement, and thus they fail to place tasks at the best possible worker machine. Also, they do not sufficiently consider the diverse resource requirements of jobs (tasks) for placing tasks on worker machines. To address this challenge, we propose a Dependency-aware and Resource-efficient Scheduling (DRS) to achieve low response time and high resource utilization. DRS takes into account task dependency and assigns tasks that are independent of each other to different worker machines. Also, DRS considers tasks' resource requirements and packs complementary tasks whose resource demands on multiple resources are complementary to each other to increase the resource utilization. In addition, DRS uses the mutual reinforcement learning to estimate the task's waiting time (the duration of tasks in the queue of a worker), and assigns tasks to workers with the consideration of tasks' waiting time to reduce the response time. Extensive experimental results based on a real cluster and experiments using real-world Amazon EC2 cloud service show that DRS achieves low response time and high resource utilization compared to previous strategies.

[1]  Calton Pu,et al.  Economical and Robust Provisioning of N-Tier Cloud Workloads: A Multi-level Control Approach , 2011, 2011 31st International Conference on Distributed Computing Systems.

[2]  Husnu S. Narman,et al.  A Survey of Mobile Crowdsensing Techniques , 2018, ACM Trans. Cyber Phys. Syst..

[3]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[4]  William E. Weihl,et al.  Lottery scheduling: flexible proportional-share resource management , 1994, OSDI '94.

[5]  Haiying Shen,et al.  Probabilistic demand allocation for cloud service brokerage , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[6]  Haiying Shen,et al.  An Exploration of Designing a Hybrid Scale-Up/Out Hadoop Architecture Based on Performance Measurements , 2017, IEEE Transactions on Parallel and Distributed Systems.

[7]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[8]  Srikanth Kandula,et al.  Efficient queue management for cluster scheduling , 2016, EuroSys.

[9]  Anne-Marie Kermarrec,et al.  Hawk: Hybrid Datacenter Scheduling , 2015, USENIX Annual Technical Conference.

[10]  Luiz André Barroso,et al.  The tail at scale , 2013, CACM.

[11]  Scott Shenker,et al.  Choosy: max-min fair sharing for datacenter jobs with constraints , 2013, EuroSys '13.

[12]  Jianwei Liu,et al.  SCPS: A Social-Aware Distributed Cyber-Physical Human-Centric Search Engine , 2015, IEEE Transactions on Computers.

[13]  Devavrat Shah,et al.  Gossip Algorithms , 2009, Found. Trends Netw..

[14]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[15]  Rizos Sakellariou,et al.  A Performance Model to Estimate Execution Time of Scientific Workflows on the Cloud , 2014, 2014 9th Workshop on Workflows in Support of Large-Scale Science.

[16]  Lei Yu,et al.  Question Quality Analysis and Prediction in Community Question Answering Services with Coupled Mutual Reinforcement , 2017, IEEE Transactions on Services Computing.

[17]  Mor Harchol-Balter,et al.  Size-based scheduling to improve web performance , 2003, TOCS.

[18]  Adam Wierman,et al.  Classifying scheduling policies with respect to unfairness in an M/GI/1 , 2003, SIGMETRICS '03.

[19]  Cristina L. Abad,et al.  Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in mapreduce clusters , 2013, SoCC.

[20]  Yong Zhao,et al.  Falkon: a Fast and Light-weight tasK executiON framework , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21]  Sudipto Guha,et al.  Throughput maximization of real-time scheduling with batching , 2002, SODA '02.

[22]  Hongxin Hu,et al.  Load-aware and congestion-free state management in network function virtualization , 2017, 2017 International Conference on Computing, Networking and Communications (ICNC).

[23]  David Abramson,et al.  Scheduling parameter sweep applications on global Grids: a deadline and budget constrained cost–time optimization algorithm , 2005, Softw. Pract. Exp..

[24]  Husnu S. Narman,et al.  Characterizing Data Deliverability of Greedy Routing in Wireless Sensor Networks , 2015, IEEE Transactions on Mobile Computing.

[25]  Calton Pu,et al.  Intelligent management of virtualized resources for database systems in cloud environment , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[26]  Randy H. Katz,et al.  Improving MapReduce Performance in Heterogeneous Environments , 2008, OSDI.

[27]  Haiying Shen,et al.  CORP: Cooperative Opportunistic Resource Provisioning for Short-Lived Jobs in Cloud Systems , 2016, 2016 IEEE International Conference on Cluster Computing (CLUSTER).

[28]  Lei Yu,et al.  Energy-efficient cooperative broadcast in fading wireless networks , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[29]  Lei Ying,et al.  A throughput optimal algorithm for map task scheduling in mapreduce with data locality , 2013, PERV.

[30]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[31]  Mihalis G. Markakis,et al.  Queue-Length Asymptotics for Generalized Max-Weight Scheduling in the Presence of Heavy-Tailed Traffic , 2010, IEEE/ACM Transactions on Networking.

[32]  Calton Pu,et al.  ActiveSLA: a profit-oriented admission control framework for database-as-a-service providers , 2011, SoCC.

[33]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[34]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[35]  Ion Stoica,et al.  True elasticity in multi-tenant data-intensive compute clusters , 2012, SoCC '12.

[36]  Haiying Shen,et al.  A Low-Cost Multi-failure Resilient Replication Scheme for High Data Availability in Cloud Storage , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[37]  Srikanth Kandula,et al.  Multi-resource packing for cluster schedulers , 2014, SIGCOMM.

[38]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[39]  Patrick Wendell,et al.  Batch Sampling : Low Overhead Scheduling for Sub-Second Parallel Jobs , 2012 .

[40]  Mor Harchol-Balter Task assignment with unknown duration , 2002, JACM.

[41]  Norman M. Sadeh,et al.  Decentralized Preemptive Scheduling Across Heterogeneous Multi-core Grid Resources , 2013, JSSPP.

[42]  Chita R. Das,et al.  Modeling and synthesizing task placement constraints in Google compute clusters , 2011, SoCC.

[43]  Benjamin Avi-Itzhak,et al.  A resource-allocation queueing fairness measure , 2004, SIGMETRICS '04/Performance '04.

[44]  Mor Harchol-Balter,et al.  TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters , 2016, EuroSys.

[45]  Haiying Shen,et al.  SCPS: A Social-Aware Distributed Cyber-Physical Human-Centric Search Engine , 2011, 2011 IEEE Global Telecommunications Conference - GLOBECOM 2011.