Prometheus: Coherent Exploration of Hardware and Software Optimizations Using Aspen

With the dramatic increase in scale expected for Exascale computing, there is a dire need for tuning of hardware configurations and software optimizations such that they are in unison. However, the expected increase in tunable hardware parameters makes searching through the design space for optimal hardware-and-software configurations much more challenging. Towards this end, we propose a composable hardware-software optimization framework called Prometheus. Prometheus uses a combination of analytical and machine-learning techniques to capture application characteristics and subsequently determine the hardware-software configuration for near-optimal performance. We evaluate Prometheus for its efficacy using two widely used proxy applications: LULESH and CoMD. We demonstrate that Prometheus identifies near-optimal hardware-software configurations and verify the results via brute-force search of the design space.

[1]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[2]  Ronald G. Dreslinski,et al.  The M5 Simulator: Modeling Networked Systems , 2006, IEEE Micro.

[3]  Frank Bellosa,et al.  The benefits of event: driven energy accounting in power-sensitive systems , 2000, ACM SIGOPS European Workshop.

[4]  Abhinav Vishnu,et al.  Codesign Challenges for Exascale Systems: Performance, Power, and Reliability , 2011, Computer.

[5]  Martin Schulz,et al.  Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[6]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[7]  John Shalf,et al.  Software Design Space Exploration for Exascale Combustion Co-design , 2013, ISC.

[8]  Jack Dongarra,et al.  Using PAPI for Hardware Performance Monitoring on Linux Systems , 2001 .

[9]  Martin Schulz,et al.  A Machine Learning Framework for Performance Coverage Analysis of Proxy Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[10]  John Shalf,et al.  Toward codesign in high performance computing systems , 2012, 2012 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Samuel Williams,et al.  ExaSAT: An exascale co-design tool for performance modeling , 2015, Int. J. High Perform. Comput. Appl..

[12]  J. Tukey,et al.  An algorithm for the machine calculation of complex Fourier series , 1965 .

[13]  Gianluca Palermo,et al.  COBAYN: Compiler Autotuning Framework Using Bayesian Networks , 2016, ACM Trans. Archit. Code Optim..

[14]  Frank Bellosa,et al.  Memory-aware Scheduling for Energy Efficiency on Multicore Processors , 2008, HotPower.

[15]  Jeffrey S. Vetter,et al.  Aspen: A domain specific language for performance modeling , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Tao Li,et al.  Power-performance co-optimization of throughput core architecture using resistive memory , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Luciano Lavagno,et al.  ECOSCALE: Reconfigurable computing and runtime system for future exascale systems , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Venkatram Vishwanath,et al.  SKOPE: a framework for modeling and exploring workload behavior , 2014, Conf. Computing Frontiers.

[19]  Sameer Kulkarni,et al.  Mitigating the compiler optimization phase-ordering problem using machine learning , 2012, OOPSLA '12.

[20]  Bronis R. de Supinski,et al.  Prediction models for multi-dimensional power-performance optimization on many cores , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[21]  Michael F. P. O'Boyle,et al.  Rapidly Selecting Good Compiler Optimizations using Performance Counters , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[22]  J. Bezdek,et al.  FCM: The fuzzy c-means clustering algorithm , 1984 .

[23]  Helgi Adalsteinsson,et al.  Using simulation to design extremescale applications and architectures: programming model exploration , 2011, PERV.

[24]  Ion Stoica,et al.  Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics , 2016, NSDI.

[25]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[26]  Shuaiwen Song,et al.  A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[27]  Kirk W. Cameron,et al.  A Study of Power-Performance Modeling Using a Domain-Specific Language , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[28]  Richard F. Barrett,et al.  Exascale design space exploration and co-design , 2014, Future Gener. Comput. Syst..

[29]  Robert B. Ross,et al.  CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures , 2011 .

[30]  Seyong Lee,et al.  COMPASS: A Framework for Automated Performance Modeling and Prediction , 2015, ICS.

[31]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[32]  Martin Schulz,et al.  Power Balancing in an Emulated Exascale Environment , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[33]  Alan B. Williams,et al.  Copy of Mini-applications: Vehicles for Co-Design. , 2011 .

[34]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[35]  Rong Ge,et al.  Application-Aware Power Coordination on Power Bounded NUMA Multicore Systems , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[36]  Shuaiwen Song,et al.  Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[37]  Gerhard Wellein,et al.  LIKWID: A Lightweight Performance-Oriented Tool Suite for x86 Multicore Environments , 2010, 2010 39th International Conference on Parallel Processing Workshops.

[38]  Richard W. Vuduc,et al.  A Roofline Model of Energy , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[39]  Scott Pakin,et al.  Exploring power behaviors and trade-offs of in-situ data analytics , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[40]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).