Exploiting system level heterogeneity to improve the performance of a GeoStatistics multi-phase task-based application

Heterogeneity is part of HPC infrastructures, not only at the intra-node but at the system level. Applications with multiple phases with distinct resource necessities can take advantage of this inter-node heterogeneity to improve performance and reduce resource idleness. Such an application is ExaGeoStat, a task-based machine learning framework specifically designed for geostatistics data. This work presents strategies to efficiently distribute multi-phase applications in system-level heterogeneous resources. We both (1) improve application phase overlap by optimizing runtime and scheduling decisions and (2) compute the optimal distribution for all the phases using a linear program leveraging node heterogeneity while limiting communication overhead. The performance gains of our phase overlap improvements are between 36% and 50% compared to the original base synchronous and homogeneous execution. We show that by adding some slow nodes to a homogeneous set of fast nodes, we can improve the performance by another 25% compared to a standard block-cyclic distribution, thereby harnessing any machine.

[1]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[2]  Robert B. Gramacy,et al.  Surrogates: Gaussian Process Modeling, Design, and Optimization for the Applied Sciences , 2020 .

[3]  Bernard Tourancheau,et al.  Efficient Block Cyclic Data Redistribution , 1996, Euro-Par, Vol. I.

[4]  Yves Robert,et al.  Static LU Decomposition on Heterogeneous Platforms , 2001, Int. J. High Perform. Comput. Appl..

[5]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[6]  Alexey L. Lastovetsky,et al.  Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers , 2001, J. Parallel Distributed Comput..

[7]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[8]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Jack Dongarra,et al.  ScaLAPACK user's guide , 1997 .

[11]  Yves Robert,et al.  Matrix Multiplication on Heterogeneous Platforms , 2001, IEEE Trans. Parallel Distributed Syst..

[12]  Thomas Hérault,et al.  Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results , 2016, Parallel Comput..

[13]  David E. Keyes,et al.  ExaGeoStat: A High Performance Unified Software for Geostatistics on Manycore Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[14]  Alexandre Denis,et al.  Scalability of the NewMadeleine Communication Library for Large Numbers of MPI Point-to-Point Requests , 2019, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[15]  Lucas Mello Schnorr,et al.  Communication-Aware Load Balancing of the LU Factorization over Heterogeneous Clusters , 2020, 2020 IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS).

[16]  R. Bergman Surrogates , 2008, Obesity.

[17]  Jack J. Dongarra,et al.  With Extreme Computing, the Rules Have Changed , 2017, Computing in Science & Engineering.

[18]  Jack J. Dongarra,et al.  SLATE: design of a modern distributed and accelerated linear algebra library , 2019, SC.

[19]  Jean-François Méhaut,et al.  Faithful performance prediction of a dynamic task‐based runtime system for heterogeneous multi‐core architectures , 2015, Concurr. Comput. Pract. Exp..

[20]  Lucas Mello Schnorr,et al.  A visual performance analysis framework for task‐based parallel applications running on hybrid clusters , 2018, Concurr. Comput. Pract. Exp..