Software Design Space Exploration for Exascale Combustion Co-design

The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.

[1]  João Correia Lopes,et al.  High Performance Computing for Computational Science - VECPAR 2010 - 9th International conference, Berkeley, CA, USA, June 22-25, 2010, Revised Selected Papers , 2011, VECPAR.

[2]  D.A.B. Miller,et al.  Rationale and challenges for optical interconnects to electronic chips , 2000, Proceedings of the IEEE.

[3]  Ali Pinar,et al.  A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..

[4]  Chi-Wang Shu,et al.  Runge-Kutta Discontinuous Galerkin Method Using WENO Limiters , 2005, SIAM J. Sci. Comput..

[5]  KennedyKen,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2004 .

[6]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[8]  Samuel Williams,et al.  A design methodology for domain-optimized power-efficient supercomputing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[9]  Shekhar Y. Borkar,et al.  Design challenges of technology scaling , 1999, IEEE Micro.

[10]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  David A. Patterson,et al.  RAMP gold: An FPGA-based architecture simulator for multiprocessors , 2010, Design Automation Conference.

[12]  Alan Jay Smith,et al.  Evaluating Associativity in CPU Caches , 1989, IEEE Trans. Computers.

[13]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[14]  Scott Klasky,et al.  Terascale direct numerical simulations of turbulent combustion using S3D , 2008 .

[15]  Markus Schordan,et al.  Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[16]  Chau-Wen Tseng,et al.  Tiling Optimizations for 3D Scientific Computations , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[17]  Chi-Wang Shu,et al.  Total variation diminishing Runge-Kutta schemes , 1998, Math. Comput..

[18]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[19]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.