A survey of high-performance computing scaling challenges

Commodity clusters revolutionized high-performance computing when they first appeared two decades ago. As scale and complexity have grown, new challenges in reliability and systemic resilience, energy efficiency and optimization and software complexity have emerged that suggest the need for re-evaluation of current approaches. This paper reviews the state of the art and reflects on some of the challenges likely to be faced when building trans-petascale computing systems, using insights and perspectives drawn from operational experience and community debates.

[1]  Samuel H. Fuller,et al.  Computing Performance: Game Over or Next Level? , 2011, Computer.

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[5]  M. Thomas Queueing Systems. Volume 1: Theory (Leonard Kleinrock) , 1976 .

[6]  John Shalf,et al.  Power efficiency in high performance computing , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[7]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Clayton G. Webster,et al.  Numerical Analysis of Fixed Point Algorithms in the Presence of Hardware Faults , 2015, SIAM J. Sci. Comput..

[9]  Benjamin Ray Seyfarth,et al.  How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters , 2000, Scalable Comput. Pract. Exp..

[10]  Al Geist,et al.  Major Computer Science Challenges At Exascale , 2009, Int. J. High Perform. Comput. Appl..

[11]  Uwe Schwiegelshohn,et al.  Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing , 2001 .

[12]  Rakesh Kumar,et al.  Adaptive Reliability Chipkill Correct (ARCC) , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2010, IEEE Trans. Dependable Secur. Comput..

[14]  Sriram Sankar,et al.  Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures , 2013, TOS.

[15]  Samuel H. Fuller,et al.  The Future of Computing Performance: Game Over or Next Level? , 2014 .

[16]  Evgenia Smirni,et al.  Power-aware resource allocation in high-end systems via online simulation , 2005, ICS '05.

[17]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[18]  Thomas Sterling,et al.  How to Build a Beowulf: A Guide to the Implementation and Application of PC Clusters 2nd Printing , 1999 .

[19]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Kashi Venkatesh Vishwanath,et al.  Modular data centers: how to design them? , 2009, LSAP '09.

[21]  Mark J. Clement,et al.  Core Algorithms of the Maui Scheduler , 2001, JSSPP.

[22]  K. Steinhubl Design of Ion-Implanted MOSFET'S with Very Small Physical Dimensions , 1974 .

[23]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[24]  Dennis Gannon,et al.  The Client and the Cloud: Democratizing Research Computing , 2011, IEEE Internet Computing.

[25]  Henri Casanova,et al.  Checkpointing strategies for parallel jobs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .