GPU Age-Aware Scheduling to Improve the Reliability of Leadership Jobs on Titan

In 2015, OLCF's Titan supercomputer experienced a significant increase in GPU related job failures. The impact on jobs was serious and OLCF decided to replace ∼50% of the GPUs. Unfortunately, jobs using more than 20% of the machine (i.e., leadership jobs) continued to encounter higher levels of application failures. These jobs contained significant amounts of both the low-failure rate and high-failure rate GPUs. The impacts of these failures are more adversely felt by leadership jobs due to longer wait times, runtimes, and higher charge rates. In this work, we have designed techniques to increase the use of low-failure GPUs in leadership jobs through targeted resource allocation. We have employed two complementary techniques, updating both the system ordering and the allocation mechanisms. Using simulation, the application of these techniques resulted in a 33% increase in low-failure GPU hours being assigned to leadership jobs. Our GPU Age-Aware Scheduling has been used in production on Titan since July of 2017.

[1]  Verónica G. Vergara Larrea,et al.  Experiences Evaluating Functionality and Performance of IBM POWER8+ Systems , 2017, ISC Workshops.

[2]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[3]  J. Enos,et al.  Topology-Aware Job Scheduling Strategies for Torus Networks , 2014 .

[4]  Carter Bays,et al.  A comparison of next-fit, first-fit, and best-fit , 1977, CACM.

[5]  Saurabh Gupta,et al.  Understanding and Exploiting Spatial Properties of System Failures on Extreme-Scale HPC Systems , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[6]  Eduardo F. D'Azevedo,et al.  MiniApps derived from production HPC applications using multiple programing models , 2018, Int. J. High Perform. Comput. Appl..

[7]  Hugo Mills,et al.  Scalable Node Allocation for Improved Performance in Regular and Anisotropic 3D Torus Supercomputers , 2011, EuroMPI.

[8]  Zhiling Lan,et al.  Reducing Energy Costs for IBM Blue Gene/P via Power-Aware Job Scheduling , 2013, JSSPP.

[9]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System , 2002, JSSPP.

[10]  José E. Moreira,et al.  Job Scheduling for the BlueGene/L System (Research Note) , 2002, Euro-Par.

[11]  Esther M. Arkin,et al.  Processor allocation on Cplant: achieving general processor locality using one-dimensional allocation strategies , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[12]  Saurabh Gupta,et al.  Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[14]  Saurabh Gupta,et al.  A Multi-faceted Approach to Job Placement for Improved Performance on Extreme-Scale Systems , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.