Sustainable GPU Computing at Scale

General purpose GPU (GPGPU) computing has produced the fastest running supercomputers in the world. For continued sustainable progress, GPU computing at scale also need to address two open issues: a) how increase applications mean time between failures (MTBF) as we increase supercomputer's component counts, and b) how to minimize unnecessary energy consumption. Since energy consumption is defined by the number of components used, we consider a sustainable high performance computing (HPC) application can allow better performance and reliability at the same time when adding computing or communication components. This paper reports a two-tier semantic statistical multiplexing framework for sustainable HPC at scale. The idea is to leverage the powers of statistic multiplexing to tame the nagging HPC scalability challenges. We include the theoretical model, sustainability analysis and computational experiments with automatic system level multiple CPU/GPU failure containment. Our results show that assuming three times slowdown of the statistical multiplexing layer, for an application using 1024 processors with 35\% checkpoint overhead, the two-tier framework will produce sustained time and energy savings for MTBF less than 6 hours. With 5% checkpoint overhead, 1.5 hour MTBF would be the break even point. These results suggest the practical feasibility for the proposed two-tier framework.

[1]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[2]  Boleslaw K. Szymanski,et al.  Synchronized Distributed Termination , 1985, IEEE Transactions on Software Engineering.

[3]  Yuan Shi,et al.  Automatic program parallelization using stateless parallel processing architecture , 2004 .

[4]  R. Deal Simulation Modeling and Analysis (2nd Ed.) , 1994 .

[5]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  Thomas Hérault,et al.  MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[8]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[9]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[10]  John T. Daly,et al.  A higher order estimate of the optimum checkpoint interval for restart dumps , 2006, Future Gener. Comput. Syst..

[11]  William Gropp,et al.  Fault Tolerance in Message Passing Interface Programs , 2004, Int. J. High Perform. Comput. Appl..

[12]  Maurice Herlihy,et al.  The topological structure of asynchronous computability , 1999, JACM.

[13]  Andrew S. Tanenbaum,et al.  Distributed systems: Principles and Paradigms , 2001 .

[14]  Nicholas Carriero,et al.  How to write parallel programs - a first course , 1990 .

[15]  Rolf Hempel,et al.  The MPI Standard for Message Passing , 1994, HPCN.

[16]  Hamid Laga,et al.  CUDA (Computer Unified Device Architecture) , 2009 .

[17]  Nancy A. Lynch,et al.  The impossibility of implementing reliable communication in the face of crashes , 1993, JACM.

[18]  Dhabaleswar K. Panda,et al.  MVAPICH-Aptus: Scalable high-performance multi-transport MPI over InfiniBand , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[19]  Franck Cappello,et al.  Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities , 2009, Int. J. High Perform. Comput. Appl..

[20]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[21]  Justin Y. Shi,et al.  Decoupling as a Foundation for Large Scale Parallel Computing , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[22]  Keith W. Ross,et al.  Computer networking - a top-down approach featuring the internet , 2000 .

[23]  E. N. Elnozahy,et al.  Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery , 2004, IEEE Transactions on Dependable and Secure Computing.

[24]  Daniel Marques,et al.  Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[25]  Averill M. Law,et al.  Simulation Modeling and Analysis , 1982 .

[26]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.