The ANTAREX tool flow for monitoring and autotuning energy efficient HPC systems

Designing and optimizing HPC applications are difficult and complex tasks, which require mastering specialized languages and tools for performance tuning. As this is incompatible with the current trend to open HPC infrastructures to a wider range of users, the availability of more sophisticated programming languages and tools to assist and automate the design stages is crucial to provide smoothly migration paths towards novel heterogeneous HPC platforms. The ANTAREX project intends to address these issues by providing a tool flow, a Domain Specific Launguage and APIs to provide application's adaptivity and to runtime manage and autotune applications for heterogeneous HPC systems. Our DSL provides a separation of concerns, where analysis, runtime adaptivity, performance tuning and energy strategies are specified separately from the application functionalities with the goal to increase productivity, significantly reduce time to solution, while making possible the deployment of substantially improved implementations. This paper presents the ANTAREX tool flow and shows the impact of optimization strategies in the context of one of the ANTAREX use cases related to personalized drug design. We show how simple strategies, not devised by typical compilers, can substantially speedup the execution and reduce energy consumption.

[1]  Peter M. W. Knijnenburg,et al.  Iterative compilation in a non-linear optimisation space , 1998 .

[2]  Vittorio Zaccaria,et al.  Customization of OpenCL applications for efficient task mapping under heterogeneous platform constraints , 2015, 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[3]  Andrea Bartolini,et al.  MS3: A Mediterranean-stile job scheduler for supercomputers - do less when it's too hot! , 2015, 2015 International Conference on High Performance Computing & Simulation (HPCS).

[4]  Keshav Pingali,et al.  Proactive Control of Approximate Programs , 2016, ASPLOS.

[5]  Arjun Suresh,et al.  Intercepting Functions for Memoization , 2015, ACM Trans. Archit. Code Optim..

[6]  Giovanni Agosta,et al.  Towards Transparently Tackling Functionality and Performance Issues across Different OpenCL Platforms , 2014, 2014 Second International Symposium on Computing and Networking.

[7]  Wayne Luk,et al.  Performance‐driven instrumentation and mapping strategies using the LARA aspect‐oriented programming approach , 2016, Softw. Pract. Exp..

[8]  Chun Chen,et al.  A Programming Language Interface to Describe Transformations and Code Generation , 2010, LCPC.

[9]  Giovanni Agosta,et al.  OpenCL performance portability for general‐purpose computation on graphics processor units: an exploration on cryptographic primitives , 2015, Concurr. Comput. Pract. Exp..

[10]  Henry Hoffmann,et al.  Dynamic knobs for responsive power-aware computing , 2011, ASPLOS XVI.

[11]  Luca Benini,et al.  ANTAREX -- AutoTuning and Adaptivity appRoach for Energy Efficient eXascale HPC Systems , 2015, 2015 IEEE 18th International Conference on Computational Science and Engineering.

[12]  Rudolf Eigenmann,et al.  Portable section-level tuning of compiler parallelized applications , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Thomas Fahringer,et al.  A Context-Aware Primitive for Nested Recursive Parallelism , 2016, Euro-Par Workshops.

[14]  Wayne Luk,et al.  Controlling a complete hardware synthesis toolchain with LARA aspects , 2013, Microprocess. Microsystems.

[15]  Vittorio Zaccaria,et al.  Evaluating orthogonality between application auto-tuning and run-time resource management for adaptive OpenCL applications , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[16]  David A. Padua,et al.  A Language for the Compact Representation of Multiple Program Versions , 2005, LCPC.

[17]  Luca Benini,et al.  The ANTAREX approach to autotuning and adaptivity for energy efficient HPC systems , 2016, Conf. Computing Frontiers.

[18]  Kedar S. Namjoshi,et al.  Loopy: Programmable and Formally Verified Loop Transformations , 2016, SAS.

[19]  Dietmar Fey,et al.  The AllScale Runtime Interface — Theoretical Foundation and Concept , 2016, 2016 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS).

[20]  Gianluca Palermo,et al.  Application autotuning to support runtime adaptivity in multicore architectures , 2015, 2015 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS).

[21]  Torsten Wilde,et al.  A Case Study of Energy Aware Scheduling on SuperMUC , 2014, ISC.

[22]  Qing Yi,et al.  POET: a scripting language for applying parameterized source‐to‐source program transformations , 2012, Softw. Pract. Exp..

[23]  Wu-chun Feng,et al.  Trends in energy-efficient computing: A perspective from the Green500 , 2013, 2013 International Green Computing Conference Proceedings.

[24]  Luca Benini,et al.  Continuous learning of HPC infrastructure models using big data analytics and in-memory processing tools , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[25]  Walter F. Tichy,et al.  Application-independent Autotuning for GPUs , 2013, PARCO.

[26]  P. Sadayappan,et al.  Annotation-based empirical performance tuning using Orio , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[27]  Michael Gerndt,et al.  READEX: Linking two ends of the computing continuum to improve energy-efficiency in dynamic applications , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[28]  Luca Benini,et al.  Quantifying the impact of variability on the energy efficiency for a next-generation ultra-green supercomputer , 2014, 2014 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED).

[29]  Ananta Tiwari,et al.  Auto-tuning for Energy Usage in Scientific Applications , 2011, Euro-Par Workshops.

[30]  João M. P. Cardoso,et al.  Use of Previously Acquired Positioning of Optimizations for Phase Ordering Exploration , 2015, SCOPES.

[31]  Albert Cohen,et al.  Processor virtualization and split compilation for heterogeneous multicore embedded systems , 2008, Design Automation Conference.

[32]  Wayne Luk,et al.  LARA: an aspect-oriented programming language for embedded systems , 2012, AOSD '12.

[33]  Apan Qasem,et al.  Improving Performance with Integrated Program Transformations , 2004 .

[34]  John Cavazos,et al.  Energy Auto-Tuning using the Polyhedral Approach , 2014 .

[35]  DONALD MICHIE,et al.  “Memo” Functions and Machine Learning , 1968, Nature.