MPI performance engineering with the MPI tool interface: The integration of MVAPICH and TAU

Abstract The desire for high performance on scalable parallel systems is increasing the complexity and tunability of MPI implementations. The MPI Tools Information Interface (MPI_T) introduced as part of the MPI 3.0 standard provides an opportunity for performance tools and external software to introspect and understand MPI runtime behavior at a deeper level to detect scalability issues. The interface also provides a mechanism to fine-tune the performance of the MPI library dynamically at runtime. In this paper, we propose an infrastructure that extends existing components — TAU, MVAPICH2, and BEACON to take advantage of the MPI_T interface and offer runtime introspection, online monitoring, recommendation generation, and autotuning capabilities. We validate our design by developing optimizations for a combination of production and synthetic applications. Using our infrastructure, we implement an autotuning policy for AmberMD (a molecular dynamics package) that monitors and reduces the internal memory footprint of the MVAPICH2 MPI library without affecting performance. For applications such as MiniAMR whose collective communication is latency sensitive, our infrastructure is able to generate recommendations to enable hardware offloading of collectives supported by MVAPICH2. By implementing this recommendation, the MPI time for MiniAMR at 224 processes reduces by 15%.

[1]  Franck Cappello,et al.  Distributed Monitoring and Management of Exascale Systems in the Argo Project , 2015, DAIS.

[2]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[3]  Edgar Gabriel,et al.  A Tool for Optimizing Runtime Parameters of Open MPI , 2008, PVM/MPI.

[4]  Bernd Mohr,et al.  A Tool Framework for Static and Dynamic Analysis of Object-Oriented Software with Templates , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[5]  Raymond Namyst,et al.  MPC: A Unified Parallel Runtime for Clusters of NUMA Machines , 2008, Euro-Par.

[6]  George Bosilca,et al.  Implementation and Usage of the PERUSE-Interface in Open MPI , 2006, PVM/MPI.

[7]  Anna Sikora,et al.  Autotuning of MPI Applications Using PTF , 2016, SEM4HPC@HPDC.

[8]  Robert J. Fowler,et al.  An early prototype of an autonomic performance environment for exascale , 2013, ROSS '13.

[9]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[10]  Martin Schulz,et al.  Exploring the Capabilities of the New MPI_T Interface , 2014, EuroMPI/ASIA.

[11]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[12]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[13]  Thomas Fahringer,et al.  Automatic tuning of MPI runtime parameter settings by using machine learning , 2010, CF '10.

[14]  Holger Gohlke,et al.  The Amber biomolecular simulation programs , 2005, J. Comput. Chem..

[15]  Mike Dubman,et al.  Scalable Hierarchical Aggregation Protocol (SHArP): A Hardware Architecture for Efficient Data Reduction , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[16]  Patricia J. Teller,et al.  MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning , 2015, EuroMPI.

[17]  Michael Gerndt,et al.  Automatic performance analysis with periscope , 2010 .

[18]  Anthony Skjellum,et al.  A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard , 1996, Parallel Comput..

[19]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.