Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints

As systems scale toward exascale, many resources will become increasingly constrained. While some of these resources have historically been explicitly allocated, many-such as network bandwidth, I/O bandwidth, or power-have not. As systems continue to evolve, we expect many such resources to become explicitly managed. This change will pose critical challenges to resource management and job scheduling. In this paper, we explore the potential of relaxing network allocation constraints for Blue Gene systems. Our objective is to improve the batch scheduling performance, where the partition-based interconnect architecture provides a unique opportunity to explicitly allocate network resources to jobs. This paper makes three major contributions. The first is substantial benchmarking of parallel applications, focusing on assessing application sensitivity to communication bandwidth at large scale. The second is three new scheduling schemes using relaxed network allocation and targeted at balancing individual job performance with overall system performance. The third is a comparative study of our scheduling schemes versus the existing scheduler on Mira, a 48-rack Blue Gene/Q system at Argonne National Laboratory. Specifically, we use job traces collected from this production system.

[1]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[2]  Bill Nitzberg,et al.  Noncontiguous Processor Allocation Algorithms for Mesh-Connected Multicomputers , 1997, IEEE Trans. Parallel Distributed Syst..

[3]  Xu Yang,et al.  Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[4]  Per Brinch Hansen An Analysis of Response Ratio Scheduling , 1971, IFIP Congress.

[5]  Dror G. Feitelson,et al.  Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling , 2001, IEEE Trans. Parallel Distributed Syst..

[6]  I. Foster,et al.  Terascale Turbulence Computation on BG / L Using the FLASH 3 Code , 2006 .

[7]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[8]  Pavan Balaji,et al.  Improving Resource Availability by Relaxing Network Allocation Constraints on Blue Gene/P , 2009, 2009 International Conference on Parallel Processing.

[9]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Zhiling Lan,et al.  Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[11]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[12]  David A. Lifka,et al.  The ANL/IBM SP Scheduling System , 1995, JSSPP.

[13]  P. Fischer,et al.  Petascale algorithms for reactor hydrodynamics , 2008 .

[14]  Esther M. Arkin,et al.  Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies , 2002 .

[15]  P. Sadayappan,et al.  Characterization of backfilling strategies for parallel job scheduling , 2002, Proceedings. International Conference on Parallel Processing Workshop.

[16]  Xu Yang,et al.  Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Brad Gallagher,et al.  Terascale turbulence computation using the FLASH3 application framework on the IBM Blue Gene/L system , 2008, IBM J. Res. Dev..

[18]  Moni Naor,et al.  Job Scheduling Strategies for Parallel Processing , 2017, Lecture Notes in Computer Science.

[19]  Laxmikant V. Kalé,et al.  Application-specific topology-aware mapping for three dimensional topologies , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[21]  Burkhard D. Steinmacher-Burow,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.

[22]  Sam Miller,et al.  Blue Gene/Q resource management architecture , 2010, 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers.

[23]  Jia Wang,et al.  Balancing job performance with system performance via locality-aware scheduling on torus-connected systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[24]  Zhiling Lan,et al.  Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling , 2013, JSSPP.

[25]  James H. Laros,et al.  The Impact of Injection Bandwidth Performance on Application Scalability , 2011, EuroMPI.

[26]  William Gropp,et al.  Exploring the relationship between parallel application run-time and network performance in clusters , 2003, 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN '03. Proceedings..

[27]  Zhiling Lan,et al.  Filtering log data: Finding the needles in the Haystack , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[28]  Zhiling Lan,et al.  Bandwidth-Aware Resource Management for Extreme Scale Systems , 2014 .

[29]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[30]  Dmitry N. Zotkin,et al.  Attacking the bottlenecks of backfilling schedulers , 2004, Cluster Computing.

[31]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[32]  Tong Li,et al.  Efficient and scalable multiprocessor fair scheduling using distributed weighted round-robin , 2009, PPoPP '09.

[33]  Archana Ganapathi,et al.  Statistics-driven workload modeling for the Cloud , 2010, 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW 2010).

[34]  S. Lupetti,et al.  Data popularity and shortest-job-first scheduling of network transfers , 2006, International Conference on Digital Telecommunications (ICDT'06).

[35]  Ali Afzal,et al.  Capacity planning and scheduling in Grid computing environments , 2008, Future Gener. Comput. Syst..

[36]  Tomohiro Inoue,et al.  The Tofu Interconnect , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[37]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[38]  Uwe Schwiegelshohn,et al.  Parallel Job Scheduling - A Status Report , 2004, JSSPP.

[39]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[41]  David H. Bailey,et al.  The NAS Parallel Benchmarks 2.0 , 2015 .

[42]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[43]  Esther M. Arkin,et al.  Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[44]  Honbo Zhou,et al.  The EASY - LoadLeveler API Project , 1996, JSSPP.

[45]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[46]  Ke Wang,et al.  ZHT: A Light-Weight Reliable Persistent Dynamic Scalable Zero-Hop Distributed Hash Table , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[47]  G. Eyink,et al.  Recovering isotropic statistics in turbulence simulations: the Kolmogorov 4/5th law. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[48]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.