Framework and Modular Infrastructure for Automation of Architectural Adaptation and Performance Optimization for HPC Systems

High performance systems have complex, diverse and rapidly evolving architectures. The span of applications, workloads, and resource use patterns is rapidly diversifying. Adapting applications for efficient execution on this spectrum of execution environments is effort intensive. There are many performance optimization tools which implement some or several aspects of the full performance optimization task but almost none are comprehensive across architectures, environments, applications, and workloads. This paper presents, illustrates, and applies a modular infrastructure which enables composition of multiple open-source tools and analyses into a set of workflows implementing comprehensive end-to-end optimization of a diverse spectrum of HPC applications on multiple architectures and for multiple resource types and parallel environments. It gives results from an implementation on the Stampede HPC system at the Texas Advanced Computing Center where a user can submit an application for optimization using only a single command line and get back an at least, partially optimized program without manual program modification for two different chips. Currently, only a subset of the possible optimizations is completely automated but this subset is rapidly growing. Case studies of applications of the workflow are presented. The implementations currently available for download as the PerfExpert tool version 4.0 supports both Sandy Bridge and Intel Phi chips.

[1]  Guojing Cong,et al.  A framework for automated performance bottleneck detection , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Rudolf Eigenmann Toward a methodology of optimizing programs for high-performance computers , 1993, ICS '93.

[3]  Dirk Schmidl,et al.  Score-P: A Unified Performance Measurement System for Petascale Applications , 2010, CHPC.

[4]  Jack J. Dongarra,et al.  A Portable Programming Interface for Performance Evaluation on Modern Processors , 2000, Int. J. High Perform. Comput. Appl..

[5]  Seetharami R. Seelam,et al.  A Productivity Centered Tools Framework for Application Performance Tuning , 2007 .

[6]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[7]  Markus Schordan,et al.  A Source-to-Source Architecture for User-Defined Optimizations , 2003, JMLC.

[8]  Renato J. O. Figueiredo,et al.  Towards an Integrated, Web-executable Parallel Programming Tool Environment , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[9]  Allen D. Malony,et al.  Knowledge support and automation for performance analysis with PerfExplorer 2.0 , 2008 .

[10]  S. Donatelli,et al.  CSL^TA: an Expressive Logic for Continuous-Time Markov Chains , 2007, Fourth International Conference on the Quantitative Evaluation of Systems (QEST 2007).

[11]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[12]  Insung Park,et al.  A Performance Advisor Tool for Shared-Memory Parallel Programming , 2000, LCPC.

[13]  Nathan R. Tallent,et al.  HPCToolkit: performance tools for scientific computing , 2008 .

[14]  Ravi Sethi,et al.  Yacc: a parser generator , 1990 .

[15]  James C. Browne,et al.  Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[16]  Nicholas Nethercote,et al.  Valgrind: A Program Supervision Framework , 2003, RV@CAV.

[17]  Jesús Labarta,et al.  Tools for Power-Energy Modelling and Analysis of Parallel Scientific Applications , 2012, 2012 41st International Conference on Parallel Processing.

[18]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[19]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[20]  Tijs van der Storm,et al.  RASCAL: A Domain Specific Language for Source Code Analysis and Manipulation , 2009, 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation.

[21]  Brian Armstrong,et al.  On the Interaction of Tiling and Automatic Parallelization , 2005, IWOMP.

[22]  Chun Chen,et al.  A scalable auto-tuning framework for compiler optimization , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[23]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[24]  Anna Sikora,et al.  AutoTune: A Plugin-Driven Approach to the Automatic Tuning of Parallel Applications , 2012, PARA.

[25]  Martin Schulz,et al.  Open | SpeedShop: An open source infrastructure for parallel performance analysis , 2008, Sci. Program..

[26]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[27]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.