Multi-resource packing for cluster schedulers

Tasks in modern data parallel clusters have highly diverse resource requirements, along CPU, memory, disk and network. Any of these resources may become bottlenecks and hence, the likelihood of wasting resources due to fragmentation is now larger. Today's schedulers do not explicitly reduce fragmentation. Worse, since they only allocate cores and memory, the resources that they ignore (disk and network) can be over-allocated leading to interference, failures and hogging of cores or memory that could have been used by other tasks. We present Tetris, a cluster scheduler that packs, i.e., matches multi-resource task requirements with resource availabilities of machines so as to increase cluster efficiency (makespan). Further, Tetris uses an analog of shortest-running-time-first to trade-off cluster efficiency for speeding up individual jobs. Tetris' packing heuristics seamlessly work alongside a large class of fairness policies. Trace-driven simulations and deployment of our prototype on a 250 node cluster shows median gains of 30% in job completion time while achieving nearly perfect fairness.

[1]  William Gropp,et al.  Beowulf Cluster Computing with Linux , 2003 .

[2]  Michael I. Jordan,et al.  Managing data transfers in computer clusters with orchestra , 2011, SIGCOMM.

[3]  Yossi Azar,et al.  Tight bounds for online vector bin packing , 2013, STOC '13.

[4]  Diksha Verma,et al.  Quincy: Fair Scheduling for Distributed Computing Clusters , 2014 .

[5]  Haitao Wu,et al.  BCube: a high performance, server-centric network architecture for modular data centers , 2009, SIGCOMM '09.

[6]  Gerhard J. Woeginger,et al.  There is no Asymptotic PTAS for Two-Dimensional Vector Packing , 1997, Inf. Process. Lett..

[7]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[8]  Benjamin Hindman,et al.  Dominant Resource Fairness: Fair Allocation of Multiple Resource Types , 2011, NSDI.

[9]  Evgenia Smirni,et al.  Predictive VM consolidation on multiple resources: Beyond load balancing , 2013, 2013 IEEE/ACM 21st International Symposium on Quality of Service (IWQoS).

[10]  Nicolas Bruno,et al.  SCOPE: parallel databases meet MapReduce , 2012, The VLDB Journal.

[11]  Srikanth Kandula,et al.  Leveraging endpoint flexibility in data-intensive clusters , 2013, SIGCOMM.

[12]  Adam Wierman,et al.  Classifying scheduling policies with respect to unfairness in an M/GI/1 , 2003, SIGMETRICS '03.

[13]  Rina Panigrahy,et al.  Heuristics for Vector Bin Packing , 2011 .

[14]  Srikanth Kandula,et al.  Reoptimizing Data Parallel Computing , 2012, NSDI.

[15]  Amin Vahdat,et al.  A scalable, commodity data center network architecture , 2008, SIGCOMM '08.

[16]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[17]  Adam Wierman,et al.  Tail asymptotics for policies favoring short jobs in a many-flows regime , 2006, SIGMETRICS '06/Performance '06.

[18]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[19]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[20]  Scott Shenker,et al.  Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling , 2010, EuroSys '10.

[21]  Albert G. Greenberg,et al.  Sharing the Data Center Network , 2011, NSDI.

[22]  Mor Harchol-Balter,et al.  Connection Scheduling in Web Servers , 1999, USENIX Symposium on Internet Technologies and Systems.

[23]  Vijay V. Vazirani,et al.  Approximation Algorithms , 2001, Springer Berlin Heidelberg.

[24]  Sumit Gulwani,et al.  SPEED: precise and efficient static estimation of program computational complexity , 2009, POPL '09.