A visual performance analysis framework for task‐based parallel applications running on hybrid clusters

Programming paradigms in High‐Performance Computing have been shifting toward task‐based models that are capable of adapting readily to heterogeneous and scalable supercomputers. The performance of task‐based application heavily depends on the runtime scheduling heuristics and on its ability to exploit computing and communication resources. Unfortunately, the traditional performance analysis strategies are unfit to fully understand task‐based runtime systems and applications: they expect a regular behavior with communication and computation phases, while task‐based applications demonstrate no clear phases. Moreover, the finer granularity of task‐based applications typically induces a stochastic behavior that leads to irregular structures that are difficult to analyze. Furthermore, the combination of application structure, scheduler, and hardware information is generally essential to understand performance issues. This paper presents a flexible framework that enables one to combine several sources of information and to create custom visualization panels allowing to understand and pinpoint performance problems incurred by bad scheduling decisions in task‐based applications. Three case‐studies using StarPU‐MPI, a task‐based multi‐node runtime system, are detailed to show how our framework can be used to study the performance of the well‐known Cholesky factorization. Performance improvements include a better task partitioning among the multi‐(GPU, core) to get closer to theoretical lower bounds, improved MPI pipelining in multi‐(node, core, GPU) to reduce the slow start, and changes in the runtime system to increase MPI bandwidth, with gains of up to 13% in the total makespan.

[1]  Emmanuel Agullo,et al.  Task-Based Conjugate Gradient: From Multi-GPU Towards Heterogeneous Architectures , 2016, Euro-Par Workshops.

[2]  George Bosilca,et al.  Poster: Matrices over Runtime Systems at Exascale , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[3]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[4]  Lucas Mello Schnorr,et al.  Visualizing More Performance Data Than What Fits on Your Screen , 2012, Parallel Tools Workshop.

[5]  Tamara Munzner,et al.  Visualization Analysis and Design , 2014, A.K. Peters visualization series.

[6]  Dan Davison,et al.  A Multi-Language Computing Environment for Literate Programming and Reproducible Research , 2012 .

[7]  Martin Schulz,et al.  Scalable Critical-Path Based Performance Analysis , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[8]  Philippe Olivier Alexandre Navaux,et al.  Towards Seismic Wave Modeling on Heterogeneous Many-Core Architectures Using Task-Based Runtime System , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[9]  Pascal Hénon,et al.  PaStiX: a high-performance parallel direct solver for sparse symmetric positive definite systems , 2002, Parallel Comput..

[10]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[11]  Bernd Hamann,et al.  Combing the Communication Hairball: Visualizing Parallel Execution Traces using Logical Time , 2014, IEEE Transactions on Visualization and Computer Graphics.

[12]  Bruno Raffin,et al.  Design and analysis of scheduling strategies for multi-CPU and multi-GPU architectures , 2015, Parallel Comput..

[13]  James M. Wilson,et al.  Gantt charts: A centenary appreciation , 2003, Eur. J. Oper. Res..

[14]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[15]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[16]  Bronis R. de Supinski,et al.  The Spack package manager: bringing order to HPC software chaos , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[18]  Lucas Mello Schnorr,et al.  Analyzing Dynamic Task-Based Applications on Hybrid Platforms: An Agile Scripting Approach , 2016, 2016 Third Workshop on Visual Performance Analysis (VPA).

[19]  Dirk Schmidl,et al.  Performance Analysis Techniques for Task-Based OpenMP Applications , 2012, IWOMP.

[20]  Jack J. Dongarra,et al.  Visualizing execution traces with task dependencies , 2015, VPA '15.

[21]  Mats Brorsson,et al.  Grain graphs: OpenMP performance analysis made easy , 2016, PPoPP.

[22]  Toni Cortes,et al.  PARAVER: A Tool to Visualize and Analyze Parallel Code , 2007 .

[23]  Matthias S. Müller,et al.  The Vampir Performance Analysis Tool-Set , 2008, Parallel Tools Workshop.

[24]  Samuel Thibault,et al.  Implementation of FEM Application on GPU with StarPU , 2013, CSE 2013.

[25]  Cédric Augonnet,et al.  StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators , 2012, EuroMPI.

[26]  Torsten Hoefler,et al.  Using Advanced MPI: Modern Features of the Message-Passing Interface , 2014 .

[27]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[28]  Michel Dagenais,et al.  A declarative framework for stateful analysis of execution traces , 2017, Software Quality Journal.

[29]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[30]  Raymond Namyst,et al.  An Efficient Multi-level Trace Toolkit for Multi-threaded Applications , 2005, Euro-Par.

[31]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[32]  Scott Chamberlain,et al.  Create Interactive Web Graphics via Plotly's JavaScript GraphingLibrary , 2015 .

[33]  Emmanuel Agullo,et al.  Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[34]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[35]  Tobias Hilbrich,et al.  Edge Bundling for Visualizing Communication Behavior , 2016, 2016 Third Workshop on Visual Performance Analysis (VPA).

[36]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[37]  José Gracia,et al.  Temanejo: Debugging of Thread-Based Task-Parallel Programs in StarSS , 2011, Parallel Tools Workshop.

[38]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[39]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[40]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[41]  T. Christoudias,et al.  Earth system modelling on system-level heterogeneous architectures: EMAC(version 2.42) on the Dynamical Exascale Entry Platform (DEEP) , 2016 .

[42]  Jack Dongarra,et al.  The TOP500: History, Trends, and Future Directions in High Performance Computing , 2020 .

[43]  Luis M. de la Cruz,et al.  General Template Units for the Finite Volume Method in Box-Shaped Domains , 2016, ACM Trans. Math. Softw..

[44]  Jean Roman,et al.  Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping , 2017, J. Comput. Sci..

[45]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[46]  Emmanuel Agullo,et al.  Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[47]  Robert Dietrich,et al.  A Case Study: Holistic Performance Analysis on Heterogeneous Architectures using the Vampir Toolchain , 2013, PARCO.

[48]  Mark Bailey,et al.  The Grammar of Graphics , 2007, Technometrics.

[49]  B. de Oliveira Stein,et al.  Pajé trace file format , 2003 .

[50]  Alejandro Duran,et al.  The Design of OpenMP Tasks , 2009, IEEE Transactions on Parallel and Distributed Systems.

[51]  Douglas Thain,et al.  DAGViz: a DAG visualization tool for analyzing task-parallel program traces , 2015, VPA '15.