Scalable System Scheduling for HPC and Big Data

Abstract In the rapidly expanding field of parallel processing, job schedulers are the “operating systems” of modern big data architectures and supercomputing systems. Job schedulers allocate computing resources and control the execution of processes on those resources. Historically, job schedulers were the domain of supercomputers, and job schedulers were designed to run massive, long-running computations over days and weeks. More recently, big data workloads have created a need for a new class of computations consisting of many short computations taking seconds or minutes that process enormous quantities of data. For both supercomputers and big data systems, the efficiency of the job scheduler represents a fundamental limit on the efficiency of the system. Detailed measurement and modeling of the performance of schedulers are critical for maximizing the performance of a large-scale computing system. This paper presents a detailed feature analysis of 15 supercomputing and big data schedulers. For big data workloads, the scheduler latency is the most important performance characteristic of the scheduler. A theoretical model of the latency of these schedulers is developed and used to design experiments targeted at measuring scheduler latency. Detailed benchmarking of four of the most popular schedulers (Slurm, Son of Grid Engine, Mesos, and Hadoop YARN) is conducted. The theoretical model is compared with data and demonstrates that scheduler performance can be characterized by two key parameters: the marginal latency of the scheduler t s and a nonlinear exponent α s . For all four schedulers, the utilization of the computing system decreases to 90% for all four of the schedulers that were tested.

[1]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[2]  Wei Lin,et al.  Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing , 2014, OSDI.

[3]  Michael Abd-El-Malek,et al.  Omega: flexible, scalable schedulers for large compute clusters , 2013, EuroSys '13.

[4]  Jeremy Kepner,et al.  LLSuperCloud: Sharing HPC systems for diverse rapid prototyping , 2013, 2013 IEEE High Performance Extreme Computing Conference (HPEC).

[5]  Arndt Bode,et al.  Resource Management in Message Passing Environments , 2001 .

[6]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[7]  Moni Naor,et al.  Job Scheduling Strategies for Parallel Processing , 2017, Lecture Notes in Computer Science.

[8]  James Patton Jones NAS Requirements Checklist for Job Queuing/Scheduling Software , 1996 .

[9]  Yonghong Yan,et al.  Comparative Study of Distributed Resource Management Systems – SGE, LSF, PBS Pro, and LoadLeveler , 2004 .

[10]  Carlo Curino,et al.  Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications , 2015, SIGMOD Conference.

[11]  Vladimir Getov,et al.  Workload Schedulers - Genesis, Algorithms and Comparisons , 2015 .

[12]  Warren Smith,et al.  A Resource Management Architecture for Metacomputing Systems , 1998, JSSPP.

[13]  Michael Lang,et al.  Exploring Distributed Resource Allocation Techniques in the SLURM Job Management System , 2013 .

[14]  Jacques Carlier,et al.  Handbook of Scheduling - Algorithms, Models, and Performance Analysis , 2004 .

[15]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  James Frew,et al.  Why data citation is a computational problem , 2016, Commun. ACM.

[17]  Tarek A. El-Ghazawi,et al.  A performance study of job management systems , 2004, Concurr. Pract. Exp..

[18]  Eric A. Brewer,et al.  Borg, Omega, and Kubernetes , 2016, ACM Queue.

[19]  Patrick Wendell,et al.  Sparrow: distributed, low latency scheduling , 2013, SOSP.

[20]  Sarah Bird PACORA : Performance Aware Convex Optimization for Resource Allocation , 2011 .

[21]  Jingwen Wang,et al.  Utopia: A load sharing facility for large, heterogeneous distributed computer systems , 1993, Softw. Pract. Exp..

[22]  Larry Rudolph,et al.  Parallel Job Scheduling: Issues and Approaches , 1995, JSSPP.

[23]  Dror G. Feitelson,et al.  Workload Modeling for Performance Evaluation , 2002, Performance.

[24]  Liana L. Fong,et al.  Partitioned Parallel Job Scheduling for Extreme Scale Computing , 2012, JSSPP.

[25]  Jorge-Arnulfo Quiané-Ruiz,et al.  Efficient Big Data Processing in Hadoop MapReduce , 2012, Proc. VLDB Endow..

[26]  Andrew S. Grimshaw,et al.  The Legion vision of a worldwide virtual computer , 1997, Commun. ACM.

[27]  John K. Ousterhout Scheduling Techniques for Concurrebt Systems. , 1982, ICDCS 1982.

[28]  Abhishek Verma,et al.  Large-scale cluster management at Google with Borg , 2015, EuroSys.

[29]  Miron Livny,et al.  A worldwide flock of Condors: Load sharing among workstation clusters , 1996, Future Gener. Comput. Syst..

[30]  Brent Kingsbury,et al.  Network Queueing System , 1986 .

[31]  Anthony A. Maciejewski,et al.  Static resource allocation for heterogeneous computing environments with tasks having dependencies, priorities, deadlines, and multiple versions , 2008, J. Parallel Distributed Comput..

[32]  Uwe Schwiegelshohn,et al.  Fairness in parallel job scheduling , 2000 .

[33]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[34]  Christina Delimitrou,et al.  Tarcil: reconciling scheduling speed and quality in large shared clusters , 2015, SoCC.

[35]  Carlo Curino,et al.  Apache Hadoop YARN: yet another resource negotiator , 2013, SoCC.

[36]  Carlo Curino,et al.  Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters , 2015, USENIX Annual Technical Conference.

[37]  Ibm Redbooks,et al.  Workload Management With Loadleveler , 2001 .

[38]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[39]  J. Kepner,et al.  Technology Requirements for Supporting On-Demand Interactive Grid Computing , 2005, 2005 Users Group Conference (DOD-UGC'05).

[40]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[41]  Randy H. Katz,et al.  Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center , 2011, NSDI.

[42]  Ladislau Bölöni,et al.  Characterizing Resource Allocation Heuristics for Heterogeneous Computing Systems , 2005, Adv. Comput..

[43]  Ming-Chang Lee,et al.  Performance evaluation of job schedulers on Hadoop YARN , 2016, Concurr. Comput. Pract. Exp..

[44]  Robert L. Henderson,et al.  Job Scheduling Under the Portable Batch System , 1995, JSSPP.

[45]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[46]  Uwe Schwiegelshohn,et al.  Analysis of first-come-first-serve parallel job scheduling , 1998, SODA '98.

[47]  Dan Tsafrir,et al.  A Short Survey of Commercial Cluster Batch Schedulers , 2005 .

[48]  Robert N. M. Watson,et al.  Firmament: Fast, Centralized Cluster Scheduling at Scale , 2016, OSDI.

[49]  Laurent Philippe,et al.  On the Heterogeneity Bias of Cost Matrices for Assessing Scheduling Algorithms , 2015, IEEE Transactions on Parallel and Distributed Systems.

[50]  James Patton Jones Evaluation of Job Queuing/Scheduling Software: Phase I Report , 1996 .

[51]  Andrew V. Goldberg,et al.  Quincy: fair scheduling for distributed computing clusters , 2009, SOSP '09.

[52]  Dror G. Feitelson,et al.  Improved Utilization and Responsiveness with Gang Scheduling , 1997, JSSPP.

[53]  Inderveer Chana,et al.  A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges , 2016, Journal of Grid Computing.

[54]  Georges Da Costa,et al.  2005 IEEE International Symposium on Cluster Computing and the Grid , 2005, CCGRID.

[55]  Albert Y. Zomaya,et al.  A survey on resource allocation in high performance distributed computing systems , 2013, Parallel Comput..

[56]  Uwe Schwiegelshohn,et al.  Parallel Job Scheduling - A Status Report , 2004, JSSPP.

[57]  César A. F. De Rose,et al.  A Performance Comparison of Container-Based Virtualization Systems for MapReduce Clusters , 2014, PDP.

[58]  Joseph Pasquale,et al.  ALPS: An Application-Level Proportional-Share Scheduler , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[59]  David E. Culler,et al.  Hierarchical scheduling for diverse datacenter workloads , 2013, SoCC.

[60]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[61]  William Saphir,et al.  Job Management Requirements for NAS Parallel Systems and Clusters , 1995, JSSPP.

[62]  Geoffrey C. Fox,et al.  A Review of Commercial and Research Cluster Management Software , 1996 .

[63]  Jeremy Kepner,et al.  LLMapReduce: Multi-level map-reduce for high performance data analysis , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[64]  David S. Munro,et al.  In: Software-Practice and Experience , 2000 .

[65]  Jeremy Kepner,et al.  HPC-VMs: Virtual machines in high performance computing systems , 2012, 2012 IEEE Conference on High Performance Extreme Computing.