Scaling the Power Wall: A Path to Exascale

Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.

[1]  Xi Chen,et al.  A 0.54 pJ/b 20 Gb/s Ground-Referenced Single-Ended Short-Reach Serial Link in 28 nm CMOS for Advanced Packaging Applications , 2013, IEEE Journal of Solid-State Circuits.

[2]  A. Burg,et al.  Towards generic low-power area-efficient standard cell based memory architectures , 2010, 2010 53rd IEEE International Midwest Symposium on Circuits and Systems.

[3]  John Shalf,et al.  Software Design Space Exploration for Exascale Combustion Co-design , 2013, ISC.

[4]  Alan B. Williams,et al.  Poster: mini-applications: vehicles for co-design , 2011, SC '11 Companion.

[5]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[6]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[7]  Song Huang,et al.  On the energy efficiency of graphics processing units for scientific computing , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  William J. Dally,et al.  Flattened butterfly: a cost-efficient topology for high-radix networks , 2007, ISCA '07.

[9]  M. Horowitz,et al.  Efficient on-chip global interconnects , 2003, 2003 Symposium on VLSI Circuits. Digest of Technical Papers (IEEE Cat. No.03CH37408).

[10]  Javier Zalamea,et al.  Two-level hierarchical register file organization for VLIW processors , 2000, MICRO 33.

[11]  Ibm Blue,et al.  Overview of the IBM Blue Gene/P Project , 2008, IBM J. Res. Dev..

[12]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[13]  R. Baker,et al.  An Sn algorithm for the massively parallel CM-200 computer , 1998 .

[14]  Anantha Chandrakasan,et al.  Application-Specific SRAM Design Using Output Prediction to Reduce Bit-Line Switching Activity and Statistically Gated Sense Amplifiers for Up to 1.9$\times$ Lower Energy/Access , 2013, IEEE Journal of Solid-State Circuits.

[15]  Margaret H. Wright,et al.  The opportunities and challenges of exascale computing , 2010 .

[16]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[17]  Zhiyu Zeng,et al.  Tradeoff analysis and optimization of power delivery networks with on-chip voltage regulation , 2010, Design Automation Conference.

[18]  John R. Rice,et al.  //ELLPACK: a numerical simulation programming environment for parallel MIMD machines , 1990, ICS '90.

[19]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[20]  Andrew A. Chien,et al.  Exascale workload characterization and architecture implications , 2013, 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[21]  Andrew Siegel,et al.  XSBENCH - THE DEVELOPMENT AND VERIFICATION OF A PERFORMANCE ABSTRACTION FOR MONTE CARLO REACTOR ANALYSIS , 2014 .

[22]  Benton H. Calhoun,et al.  A reverse write assist circuit for SRAM dynamic write VMIN tracking using canary SRAMs , 2014, Fifteenth International Symposium on Quality Electronic Design.

[23]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[24]  W. Dally,et al.  Route packets, not wires: on-chip interconnection networks , 2001, Proceedings of the 38th Design Automation Conference (IEEE Cat. No.01CH37232).

[25]  Borivoje Nikolic,et al.  SRAM Assist Techniques for Operation in a Wide Voltage Range in 28-nm CMOS , 2012, IEEE Transactions on Circuits and Systems II: Express Briefs.

[26]  William J. Dally,et al.  A compile-time managed multi-level register file hierarchy , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[28]  Augustus K. Uht,et al.  Uniprocessor performance enhancement through adaptive clock frequency control , 2005, IEEE Transactions on Computers.

[29]  Benoît Meister,et al.  Runnemede: An architecture for Ubiquitous High-Performance Computing , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[30]  William J. Dally,et al.  Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.