Extending τ-Lop to model concurrent MPI communications in multicore clusters

Achieving optimal performance of MPI applications on current multi-core architectures, composed of multiple shared communication channels and deep memory hierarchies, is not trivial. Formal analysis using parallel performance models allows one to depict the underlying behavior of the algorithms and their communication complexities, with the aims of estimating their cost and improving their performance.LogGP model was initially conceived to predict the cost of algorithms in mono-processor clusters based on point-to-point transmissions with network latency and bandwidth based parameters. It remains as the representative model, with multiple extensions for handling high performance networks, covering particular contention cases, channels hierarchies or protocol costs. These very specific branches lead LogGP to partially lose its initial abstract modeling purpose.More recent log n P represents a point-to-point transmission as a sequence of implicit transfers or data movements. Nevertheless, similar to LogGP, it models an algorithm in a parallel architecture as a sequence of message transmissions, an approach inefficient to model algorithms more advanced than simple tree-based one, as we will show in this work.In this paper, ? -Lop model is extended to multi-core clusters and compared to previous models. It demonstrates the ability to predict the cost of advanced algorithms and mechanisms used by mainstream MPI implementations, such as MPICH or Open MPI, with high accuracy. ? -Lop is based on the concept of concurrent transfers, and applies it to meaningfully represent the behavior of parallel algorithms in complex platforms with hierarchical shared communication channels, taking into account the effects of contention and deployment of processes on the processors. In addition, an exhaustive and reproducible methodology for measuring the parameters of the model is described. We present an extension of the ? -Lop performance model for multicore clusters.The ? -Lop goal is to help in the design and optimization of parallel algorithms.It is applied to collective algorithms in mainstream MPI implementations.The ? -Lop model is compared to other well known and established models.A methodology is described for the measure of the parameters of the model.

[1]  Alexey L. Lastovetsky,et al.  An accurate communication model of a heterogeneous cluster based on a switch-enabled Ethernet network , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[2]  Ziming Zhong,et al.  FuPerMod: a software tool for the optimization of data-parallel applications on heterogeneous platforms , 2014, The Journal of Supercomputing.

[3]  Luiz Angelo Steffenel,et al.  Identifying Logical Homogeneous Clusters for Efficient Wide-Area Communications , 2004, PVM/MPI.

[4]  Rong Ge,et al.  $\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems , 2007, IEEE Transactions on Computers.

[5]  Susumu Shibusawa,et al.  Scheduling algorithms for efficient gather operations in distributed heterogeneous systems , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[6]  Torsten Hoefler,et al.  LogfP - a model for small messages in InfiniBand , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Rolf Riesen,et al.  Communication Models for Resource Constrained Hierarchical Ethernet Networks , 2013, Euro-Par Workshops.

[8]  Jin Zhang,et al.  LogGPO: An accurate communication model for performance prediction of MPI programs , 2009, Science in China Series F: Information Sciences.

[9]  Rong Ge,et al.  Predicting and Evaluating Distributed Communication Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[12]  Jesper Larsson Träff,et al.  An Optimal Broadcast Algorithm Adapted to SMP Clusters , 2005, PVM/MPI.

[13]  Jean-François Méhaut,et al.  Prediction of Communication Latency over Complex Network Behaviors on SMP Clusters , 2005, EPEW/WS-FM.

[14]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[15]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[16]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[17]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[18]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[19]  Fukuhito Ooshita,et al.  Efficient gather operation in heterogeneous cluster systems , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[20]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[21]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[22]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[23]  Juan Carlos Díaz Martín,et al.  τ-Lop: Modeling performance of shared memory MPI , 2015, Parallel Comput..

[24]  Alexey L. Lastovetsky,et al.  Revisiting communication performance models for computational clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[25]  Csaba Andras Moritz,et al.  Performance Modeling and Evaluation of MPI , 2001, J. Parallel Distributed Comput..

[26]  Viktor K. Prasanna,et al.  Efficient collective communication in distributed heterogeneous systems , 1999, Proceedings. 19th IEEE International Conference on Distributed Computing Systems (Cat. No.99CB37003).

[27]  Alexey L. Lastovetsky,et al.  Modeling Contention and Mapping Effects in Multi-core Clusters , 2015, Euro-Par Workshops.

[28]  Jean-François Méhaut,et al.  A Contention-Aware Performance Model for HPC-Based Networks: A Case Study of the InfiniBand Network , 2011, Euro-Par.

[29]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[30]  W HockneyRoger The communication challenge for MPP , 1994 .

[31]  Alexey L. Lastovetsky,et al.  Accurate and Efficient Estimation of Parameters of Heterogeneous Communication Performance Models , 2009, Int. J. High Perform. Comput. Appl..

[32]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[33]  V. Jerome,et al.  Predictive models for bandwidth sharing in high performance clusters , 2008, CLUSTER 2008.

[34]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[35]  Rolf Rabenseifner,et al.  Automatic Profiling of MPI Applications with Hardware Performance Counters , 1999, PVM/MPI.

[36]  Luiz Angelo Steffenel,et al.  Modeling Network Contention Effects on All-to-All Operations , 2006, 2006 IEEE International Conference on Cluster Computing.

[37]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[38]  Liang Yuan,et al.  LogGPH: A Parallel Computational Model with Hierarchical Communication Awareness , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[39]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[40]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[41]  Alexey L. Lastovetsky,et al.  Data Partitioning with a Functional Performance Model of Heterogeneous Processors , 2007, Int. J. High Perform. Comput. Appl..

[42]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[43]  Kirk W. Cameron,et al.  Quantifying locality effect in data access delay: memory logP , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[44]  Sang Cheol Kim,et al.  Measurement and Prediction of Communication Delays in Myrinet Networks , 2001, J. Parallel Distributed Comput..

[45]  Alexey L. Lastovetsky,et al.  High Performance Heterogeneous Computing , 2009, Wiley series on parallel and distributed computing.

[46]  Jesper Larsson Träff,et al.  More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems , 2004, PVM/MPI.

[47]  Robert A. van de Geijn,et al.  On optimizing collective communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).

[48]  Ziming Zhong,et al.  FuPerMod: A Framework for Optimal Data Partitioning for Parallel Scientific Applications on Dedicated Heterogeneous HPC Platforms , 2013, PaCT.

[49]  Milton Abramowitz,et al.  Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables , 1964 .

[50]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[51]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[52]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[53]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..