Towards Scalable Distributed Workload Manager with Monitoring-Based Weakly Consistent Resource Stealing

One way to efficiently utilize the coming exascale machines is to support a mixture of applications in various domains, such as traditional large-scale HPC, the ensemble runs, and the fine-grained many-task computing (MTC). Delivering high performance in resource allocation, scheduling and launching for all types of jobs has driven us to develop Slurm++, a distributed workload manager directly extended from the Slurm centralized production system. Slurm++ employs multiple controllers with each one managing a partition of compute nodes and participating in resource allocation through resource balancing techniques. In this paper, we propose a monitoring-based weakly consistent resource stealing technique to achieve resource balancing in distributed HPC job launch, and implement the technique in Slurm++. We compare Slurm++ with Slurm using micro-benchmark workloads with different job sizes. Slurm++ showed 10X faster than Slurm in allocating resources and launching jobs -- we expect the performance gap to grow as the job sizes and system scales increase in future high-end computing systems.

[1]  Michael Lang,et al.  Exploring the Design Tradeoffs for Exascale System Services through Simulation , 2013 .

[2]  Karl Solchenbach,et al.  Ensemble Simulations on highly Scaling HPC Systems (EnSIM) , 2010 .

[3]  Garth A. Gibson,et al.  PRObE: A Thousand-Node Experimental Cluster for Computer Systems Research , 2013, login Usenix Mag..

[4]  Ke Wang,et al.  Exploring reliability of exascale systems through simulations , 2013, SpringSim.

[5]  Robert B. Ross,et al.  FusionFS: Toward supporting data-intensive scientific applications on extreme-scale high-performance computing systems , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[6]  Michael Lang,et al.  Using simulation to explore distributed key-value stores for extreme-scale system services , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Wolfgang Gentzsch,et al.  Sun Grid Engine: towards creating a compute power grid , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[8]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Ian T. Foster,et al.  The Design, Usage, and Performance of GRUBER: A Grid Usage Service Level Agreement based BrokERing Infrastructure , 2006, Journal of Grid Computing.

[10]  Jack Dongarra,et al.  MPI: The Complete Reference , 1996 .

[11]  Ke Wang,et al.  SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale , 2013, SpringSim.

[12]  I. Raicu,et al.  MATRIX : MAny-Task computing execution fabRIc at eXascale , 2013 .

[13]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[14]  Ke Wang,et al.  Modeling Many-Task Computing Workloads on a Petaflop IBM Blue Gene/P Supercomputer , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[15]  Zhao Zhang,et al.  Extreme-scale scripting: Opportunities for large task-parallel applications on petascale computers , 2009 .

[16]  Ke Wang,et al.  Centralized and Distributed Job Scheduling System Simulation at Exascale , 2011 .

[17]  Michael Lang,et al.  Optimizing load balancing and data-locality with data-aware scheduling , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[18]  Ke Wang,et al.  Achieving Efficient Distributed Scheduling with Message Queues in the Cloud for Many-Task Computing and High-Performance Computing , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[19]  Zhou Lei,et al.  The portable batch scheduler and the maui scheduler on linux clusters , 2000 .

[20]  Michael Lang,et al.  Next generation job management systems for extreme-scale ensemble computing , 2014, HPDC '14.

[21]  Xian-he Sun,et al.  Towards Next Generation Resource Management at Extreme-Scales , 2014 .

[22]  Michael Lang,et al.  Exploring Distributed Resource Allocation Techniques in the SLURM Job Management System , 2013 .

[23]  Zhao Zhang,et al.  Paving the Road to Exascale with Many-Task Computing , 2013 .

[24]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[25]  Yong Zhao,et al.  Many-task computing for grids and supercomputers , 2008, 2008 Workshop on Many-Task Computing on Grids and Supercomputers.

[26]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[27]  Erik A. Hendriks,et al.  BProc: the Beowulf distributed process space , 2002, ICS '02.

[28]  Ian T. Foster,et al.  Experiences in Running Workloads over Grid3 , 2005, GCC.