Improving Batch Scheduling on Blue Gene/Q by Relaxing 5D Torus Network Allocation Constraints

As systems scale toward exactable, many resources will become increasingly constrained. While some of these resources have historically been explicitly allocated, many -- such as network bandwidth, I/O bandwidth, or power -- have not. As systems continue to evolve, we expect many such resources to become explicitly managed. This change will pose critical challenges to resource management and job scheduling. In this paper, we explore the potentiality of relaxing network allocation constraints for Blue Gene systems. Our objectives to improve the batch scheduling performance, where the partition-based interconnect architecture provides a unique opportunity to explicitly allocate network resources to jobs. This paper makes three major contributions. The first is substantial benchmarking of parallel applications, focusing on assessing application sensitivity to communication bandwidth at large scale. The second is two new scheduling schemes using relaxed network allocation and targeted at balancing individual job performance with overall system performance. The third is a comparative study of our scheduling schemes versus the existing one under different workloads, using job traces collected from the 48-rack Mira, an IBM Blue Gene/Q system at Argonne National Laboratory.

[1]  Uwe Schwiegelshohn,et al.  Parallel Job Scheduling - A Status Report , 2004, JSSPP.

[2]  Laxmikant V. Kalé,et al.  Application-specific topology-aware mapping for three dimensional topologies , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[3]  G. Eyink,et al.  Recovering isotropic statistics in turbulence simulations: the Kolmogorov 4/5th law. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[4]  Ioan Raicu,et al.  HyCache+: Towards Scalable High-Performance Caching Middleware for Parallel File Systems , 2014, 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[5]  William Gropp,et al.  Exploring the relationship between parallel application run-time and network performance in clusters , 2003, 28th Annual IEEE International Conference on Local Computer Networks, 2003. LCN '03. Proceedings..

[6]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[7]  I. Foster,et al.  Terascale Turbulence Computation on BG / L Using the FLASH 3 Code , 2006 .

[8]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[9]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[10]  William T. C. Kramer,et al.  Performance Variability of Highly Parallel Architectures , 2003, International Conference on Computational Science.

[11]  Zhiling Lan,et al.  Reducing Fragmentation on Torus-Connected Supercomputers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  B. Fryxell,et al.  FLASH: An Adaptive Mesh Hydrodynamics Code for Modeling Astrophysical Thermonuclear Flashes , 2000 .

[13]  Zhiling Lan,et al.  Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling , 2013, JSSPP.

[14]  Nicholas J. Wright,et al.  Measuring and Understanding Variation in Benchmark Performance , 2009, 2009 DoD High Performance Computing Modernization Program Users Group Conference.

[15]  James H. Laros,et al.  The Impact of Injection Bandwidth Performance on Application Scalability , 2011, EuroMPI.

[16]  Zhiling Lan,et al.  Bandwidth-Aware Resource Management for Extreme Scale Systems , 2014 .

[17]  James Patton Jones,et al.  Scheduling for Parallel Supercomputing: A Historical Perspective of Achievable Utilization , 1999, JSSPP.

[18]  Brad Gallagher,et al.  Terascale turbulence computation using the FLASH3 application framework on the IBM Blue Gene/L system , 2008, IBM J. Res. Dev..

[19]  Pavan Balaji,et al.  Improving Resource Availability by Relaxing Network Allocation Constraints on Blue Gene/P , 2009, 2009 International Conference on Parallel Processing.

[20]  Zhiling Lan,et al.  Analyzing and adjusting user runtime estimates to improve job scheduling on the Blue Gene/P , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[21]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[22]  Xu Yang,et al.  Integrating dynamic pricing of electricity into energy aware scheduling for HPC systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  Moni Naor,et al.  Job Scheduling Strategies for Parallel Processing , 2017, Lecture Notes in Computer Science.

[24]  P. Fischer,et al.  Petascale algorithms for reactor hydrodynamics , 2008 .

[25]  Jia Wang,et al.  Balancing job performance with system performance via locality-aware scheduling on torus-connected systems , 2014, 2014 IEEE International Conference on Cluster Computing (CLUSTER).

[26]  Tomohiro Inoue,et al.  The Tofu Interconnect , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[27]  David H. Bailey,et al.  The NAS Parallel Benchmarks 2.0 , 2015 .

[28]  Zhiling Lan,et al.  Fault-aware, utility-based job scheduling on Blue, Gene/P systems , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[29]  D. Skinner,et al.  Understanding the causes of performance variability in HPC workloads , 2005, IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005..

[30]  P. Heidelberger,et al.  The IBM Blue Gene/Q Interconnection Fabric , 2012, IEEE Micro.