Performance modeling of a geophysics application to accelerate over‐decomposition parameter tuning through simulation

Finite‐difference methods are commonplace in High Performance Computing applications. Despite their apparent regularity, they often exhibit load imbalance that damages their efficiency. We characterize the spatial and temporal load imbalance of Ondes3D, a typical finite‐differences application dedicated to earthquake modeling. Our analysis reveals imbalance originating from the structure of the input data, and from low‐level CPU optimizations. Ondes3D was successfully ported to AMPI/CHARM++ using over‐decomposition and MPI process migration techniques to dynamically rebalance the load. However, this approach requires careful selection of the over‐decomposition level, the load balancing algorithm, and its activation frequency. These choices are usually tied to application structure and platform characteristics. In this article, we propose a workflow that leverages the capabilities of SimGrid to conduct such study at low experimental cost. We rely on a combination of emulation, simulation, and application modeling that requires minimal code modification and manages to capture both spatial and temporal load imbalance to faithfully predict the performance of dynamic load balancing. We evaluate the quality of our simulation by comparing simulation results with the outcome of real executions and demonstrate how this approach can be used to quickly find the optimal load balancing configuration for a given application/hardware configuration.

[1]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[2]  Emmanuel Jeannot,et al.  Adding Virtualization Capabilities to the Grid'5000 Testbed , 2012, CLOSER.

[3]  Henri Casanova,et al.  Simulation of MPI applications with time‐independent traces , 2015, Concurr. Comput. Pract. Exp..

[4]  Philippe Olivier Alexandre Navaux,et al.  A topology-aware load balancing algorithm for clustered hierarchical multi-core machines , 2014, Future Gener. Comput. Syst..

[5]  Arnaud Legrand,et al.  Toward Better Simulation of MPI Applications on Ethernet/TCP Networks , 2013, PMBS@SC.

[6]  Laxmikant V. Kalé,et al.  BigSim: a parallel simulator for performance prediction of extremely large parallel machines , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[7]  Ali Pinar,et al.  A Simulator for Large-Scale Parallel Computer Architectures , 2010, Int. J. Distributed Syst. Technol..

[8]  Bruce Jacob,et al.  The structural simulation toolkit , 2006, PERV.

[9]  Henri Casanova,et al.  Single Node On-Line Simulation of MPI Applications with SMPI , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Fabrice Dupros,et al.  Finite Difference Simulations of Seismic Wave Propagation for the 2007 Mw 6.6 Niigata-ken Chuetsu-Oki Earthquake: Validity of Models and Reliable Input Ground Motion in the Near-Field , 2011, Pure and Applied Geophysics.

[11]  Laércio Lima Pilla,et al.  Topology-Aware Load Balancing for Performance Portability over Parallel High Performance Systems. (Équilibrage de charge prenant en compte la topologie des plates-formes de calcul parallèle pour la portabilité des performances) , 2014 .

[12]  D. K. Arvind,et al.  Languages and Compilers for Parallel Computing , 2014, Lecture Notes in Computer Science.

[13]  P. Moczo,et al.  The finite-difference time-domain method for modeling of seismic wave propagation , 2007 .

[14]  Philippe Olivier Alexandre Navaux,et al.  Towards Seismic Wave Modeling on Heterogeneous Many-Core Architectures Using Task-Based Runtime System , 2015, 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[15]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[16]  Laxmikant V. Kalé,et al.  Performance evaluation of adaptive MPI , 2006, PPoPP '06.

[17]  Frédéric Suter,et al.  Improving the Accuracy and Efficiency of Time-Independent Trace Replay , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Martin Schulz,et al.  ScalaTrace: Scalable compression and replay of communication traces for high-performance computing , 2008, J. Parallel Distributed Comput..

[19]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[20]  Henri Casanova,et al.  On the validity of flow-level tcp network models for grid and cloud simulations , 2013, TOMC.

[21]  Lucas Mello Schnorr,et al.  Using Simulation to Evaluate and Tune the Performance of Dynamic Load Balancing of an Over-Decomposed Geophysics Application , 2017, Euro-Par.

[22]  Christian Engelmann,et al.  Scaling to a million cores and beyond: Using light-weight simulation to understand the challenges ahead on the road to exascale , 2014, Future Gener. Comput. Syst..

[23]  Arnaud Legrand,et al.  An Effective Git And Org-Mode Based Workflow For Reproducible Research , 2015, OPSR.

[24]  Philippe Olivier Alexandre Navaux,et al.  Improving the Performance of Seismic Wave Simulations with Dynamic Load Balancing , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[25]  George Ho,et al.  PAPI: A Portable Interface to Hardware Performance Counters , 1999 .

[26]  Jean Roman,et al.  High-performance finite-element simulations of seismic wave propagation in three-dimensional nonlinear inelastic geological media , 2010, Parallel Comput..

[27]  Jean-François Méhaut,et al.  Faithful performance prediction of a dynamic task‐based runtime system for heterogeneous multi‐core architectures , 2015, Concurr. Comput. Pract. Exp..

[28]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[29]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[30]  Fabrice Dupros,et al.  On Scalability Issues of the Elastodynamics Equations on Multicore Platforms , 2013, ICCS.

[31]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[32]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.