A Survey of Communication Performance Models for High-Performance Computing

This survey aims to present the state of the art in analytic communication performance models, providing sufficiently detailed descriptions of particularly noteworthy efforts. Modeling the cost of communications in computer clusters is an important and challenging problem. It provides insights into the design of the communication pattern of parallel scientific applications and mathematical kernels and sets a clear ground for optimization of their deployment in the increasingly complex high-performance computing infrastructure. The survey provides background information on how different performance models represent the underlying platform and shows the evolution of these models over time from early clusters of single-core processors to present-day multi-core and heterogeneous platforms. Prospective directions for future research in the area of analytic communication performance modeling conclude the survey.

[1]  Jean-François Méhaut,et al.  A Contention-Aware Performance Model for HPC-Based Networks: A Case Study of the InfiniBand Network , 2011, Euro-Par.

[2]  Duncan A. Grove,et al.  Precise MPI Performance Measurement Using MPIBench , 2001 .

[3]  Alexey L. Lastovetsky,et al.  Accurate Heterogeneous Communication Models and a Software Tool for Their Efficient Estimation , 2010, Int. J. High Perform. Comput. Appl..

[4]  Torsten Hoefler,et al.  LogfP - a model for small messages in InfiniBand , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[5]  Viktor K. Prasanna,et al.  Efficient collective communication in distributed heterogeneous systems , 2003, J. Parallel Distributed Comput..

[6]  Alexey L. Lastovetsky,et al.  Building the communication performance model of heterogeneous clusters based on a switched network , 2007, 2007 IEEE International Conference on Cluster Computing.

[7]  Jesper Larsson Träff,et al.  More Efficient Reduction Algorithms for Non-Power-of-Two Number of Processors in Message-Passing Parallel Systems , 2004, PVM/MPI.

[8]  Lionel M. Ni,et al.  Construction of optimal multicast trees based on the parameterized communication model , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[9]  Henri Casanova,et al.  Versatile, scalable, and accurate simulation of distributed applications and platforms , 2014, J. Parallel Distributed Comput..

[10]  Luiz Angelo Steffenel,et al.  Fast Tuning of Intra-cluster Collective Communications , 2004, PVM/MPI.

[11]  Torsten Hoefler,et al.  A practical approach to the rating of barrier algorithms using the LogP model and Open MPI , 2005, 2005 International Conference on Parallel Processing Workshops (ICPPW'05).

[12]  Dhabaleswar K. Panda,et al.  Efficient collective communication on heterogeneous networks of workstations , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[13]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[14]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[15]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[16]  D.E. Culler,et al.  Effects Of Communication Latency, Overhead, And Bandwidth In A Cluster Architecture , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[17]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[18]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[19]  Csaba Andras Moritz,et al.  LoGPC: Modeling Network Contention in Message-Passing Programs , 2001, IEEE Trans. Parallel Distributed Syst..

[20]  Sang Cheol Kim,et al.  Measurement and Prediction of Communication Delays in Myrinet Networks , 2001, J. Parallel Distributed Comput..

[21]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[22]  Sascha Hunold,et al.  MPI Benchmarking Revisited: Experimental Design and Reproducibility , 2015, ArXiv.

[23]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[24]  Luiz Angelo Steffenel,et al.  Modeling Network Contention Effects on All-to-All Operations , 2006, 2006 IEEE International Conference on Cluster Computing.

[25]  John L. Hennessy,et al.  The Effects of Latency, Occupancy, and Bandwidth in Distributed Shared Memory Multiprocessors , 1995 .

[26]  Cho-Li Wang,et al.  Realistic communication model for parallel computing on cluster , 1999, ICWC 99. IEEE Computer Society International Workshop on Cluster Computing.

[27]  Bruce M. Maggs,et al.  Proceedings of the 28th Annual Hawaii International Conference on System Sciences- 1995 Models of Parallel Computation: A Survey and Synthesis , 2022 .

[28]  Teck Chaw Ling,et al.  Performance modeling for hierarchical graph partitioning in heterogeneous multi-core environment , 2015, Parallel Comput..

[29]  Alexey L. Lastovetsky,et al.  Model-Based Optimization of EULAG Kernel on Intel Xeon Phi Through Load Imbalancing , 2017, IEEE Transactions on Parallel and Distributed Systems.

[30]  Alexey L. Lastovetsky,et al.  Accurate and Efficient Estimation of Parameters of Heterogeneous Communication Performance Models , 2009, Int. J. High Perform. Comput. Appl..

[31]  Wahid Nasri,et al.  PLP: Towards a realistic and accurate model for communication performances on hierarchical cluster-based systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[32]  Kees Verstoep,et al.  Network performance-aware collective communication for clustered wide-area systems , 2001, Parallel Comput..

[33]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[34]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[35]  Liang Yuan,et al.  LogGPH: A Parallel Computational Model with Hierarchical Communication Awareness , 2010, 2010 13th IEEE International Conference on Computational Science and Engineering.

[36]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.

[37]  Joseph JáJá,et al.  An Introduction to Parallel Algorithms , 1992 .

[38]  Richard P. Martin,et al.  Assessing Fast Network Interfaces , 1996, IEEE Micro.

[39]  Franck Cappello,et al.  HiHCoHP-Toward a realistic communication model for hierarchical hyperclusters of heterogeneous processors , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[40]  Robert A. van de Geijn,et al.  SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[41]  Csaba Andras Moritz,et al.  LoGPC: modeling network contention in message-passing programs , 1998, SIGMETRICS '98/PERFORMANCE '98.

[42]  Jesper Larsson Träff,et al.  An Optimal Broadcast Algorithm Adapted to SMP Clusters , 2005, PVM/MPI.

[43]  Rong Ge,et al.  $\log_{\rm n}{\rm P}$ and $\log_{3}{\rm P}$: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems , 2007, IEEE Transactions on Computers.

[44]  Susumu Shibusawa,et al.  Scheduling algorithms for efficient gather operations in distributed heterogeneous systems , 2000, Proceedings 2000. International Workshop on Parallel Processing.

[45]  Alexey L. Lastovetsky,et al.  Model-based optimization of MPDATA on Intel Xeon Phi through load imbalancing , 2015, ArXiv.

[46]  Alexey L. Lastovetsky,et al.  An accurate communication model of a heterogeneous cluster based on a switch-enabled Ethernet network , 2006, 12th International Conference on Parallel and Distributed Systems - (ICPADS'06).

[47]  Franck Cappello,et al.  An algorithmic model for heterogeneous hyper-clusters: rationale and experience , 2005, Int. J. Found. Comput. Sci..

[48]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[49]  Alexey L. Lastovetsky,et al.  Model-Based Estimation of the Communication Cost of Hybrid Data-Parallel Applications on Heterogeneous Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[50]  K. Cameron,et al.  lognP and log3P: Accurate Analytical Models of Point-to- point Communication in Distributed Systems , 2006 .

[51]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[52]  José Luis Bosque,et al.  HLogGP: a new parallel computational model for heterogeneous clusters , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[53]  S. Sitharama Iyengar,et al.  Introduction to parallel algorithms , 1998, Wiley series on parallel and distributed computing.

[54]  Kirk W. Cameron,et al.  Quantifying locality effect in data access delay: memory logP , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[55]  Alexey N. Salnikov,et al.  The Analysis of Cluster Interconnect with the Network_Tests2 Toolkit , 2011, EuroMPI.

[56]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience: Research Articles , 2007 .

[57]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[58]  Alexey L. Lastovetsky,et al.  Adaptive parallel computing on heterogeneous networks with mpC , 2002, Parallel Comput..

[59]  Nor Asilah Wati Abdul Hamid,et al.  Comparison of MPI Benchmark Programs on Shared Memory and Distributed Memory Machines (Point-to-Point Communication) , 2010, Int. J. High Perform. Comput. Appl..

[60]  Rolf Riesen,et al.  Communication Models for Resource Constrained Hierarchical Ethernet Networks , 2013, Euro-Par Workshops.

[61]  Jin Zhang,et al.  LogGPO: An accurate communication model for performance prediction of MPI programs , 2009, Science in China Series F: Information Sciences.

[62]  Rong Ge,et al.  Predicting and Evaluating Distributed Communication Performance , 2004, Proceedings of the ACM/IEEE SC2004 Conference.

[63]  Alexey L. Lastovetsky,et al.  Hierarchical redesign of classic MPI reduction algorithms , 2016, The Journal of Supercomputing.

[64]  Juan Carlos Díaz Martín,et al.  τ-Lop: Modeling performance of shared memory MPI , 2015, Parallel Comput..

[65]  Alexey L. Lastovetsky,et al.  Revisiting communication performance models for computational clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[66]  Massimo Bernaschi,et al.  Collective communication operations: experimental results vs. theory , 1998 .

[67]  Torsten Hoefler,et al.  Multistage switches are not crossbars: Effects of static routing in high-performance networks , 2008, 2008 IEEE International Conference on Cluster Computing.

[68]  Mary K. Vernon,et al.  LoPC: modeling contention in parallel algorithms , 1997, PPOPP '97.

[69]  Kuo-Chan Huang,et al.  An Improved Model for Predicting HPL Performance , 2007, GPC.

[70]  Luiz Angelo Steffenel,et al.  Total Exchange Performance Modelling Under Network Contention , 2005, PPAM.

[71]  Jeff Rothenberg,et al.  The nature of modeling , 1989 .

[72]  Ramesh Subramonian,et al.  LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[73]  Jean-François Méhaut,et al.  Prediction of Communication Latency over Complex Network Behaviors on SMP Clusters , 2005, EPEW/WS-FM.

[74]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[75]  Bowen Alpern,et al.  A model for hierarchical memory , 1987, STOC.

[76]  Andrea C. Arpaci-Dusseau,et al.  Fast Parallel Sorting Under LogP: Experience with the CM-5 , 1996, IEEE Trans. Parallel Distributed Syst..

[77]  Cho-Li Wang,et al.  Contention-Aware Communication Schedule for High-Speed Communication , 2003, Cluster Computing.

[78]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[79]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[80]  Jesper Larsson Träff,et al.  SKaMPI: a comprehensive benchmark for public benchmarking of MPI , 2002, Sci. Program..

[81]  Mario Lauria,et al.  LogP performance characterization of fast messages atop Myrinet , 1998, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP '98 -.

[82]  Torsten Hoefler,et al.  A Communication Model for Small Messages with InfiniBand , 2005 .

[83]  Alexey L. Lastovetsky,et al.  Topology-oblivious optimization of MPI broadcast algorithms on extreme-scale platforms , 2015, Simul. Model. Pract. Theory.

[84]  Massimo Bernaschi,et al.  Collective communication operations: experimental results vs. theory , 1998, Concurr. Pract. Exp..

[85]  Sascha Hunold,et al.  Reproducible MPI Benchmarking is Still Not as Easy as You Think , 2016, IEEE Transactions on Parallel and Distributed Systems.

[86]  Eunice E. Santos,et al.  Optimal and Near-Optimal Algorithms for k-Item Broadcast , 1999, J. Parallel Distributed Comput..

[87]  坂本 文人,et al.  Argonne National Laboratory 滞在記 , 2005 .

[88]  Jean-Marc Vincent,et al.  Predictive models for bandwidth sharing in high performance clusters , 2008, 2008 IEEE International Conference on Cluster Computing.

[89]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[90]  Alexey L. Lastovetsky,et al.  Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms , 2015, The Journal of Supercomputing.

[91]  Torsten Hoefler,et al.  LogGP in theory and practice - An in-depth analysis of modern interconnection networks and benchmarking methods for collective operations , 2009, Simul. Model. Pract. Theory.

[92]  Dhabaleswar K. Panda,et al.  Communication modeling of heterogeneous networks of workstations for performance characterization of collective operations , 1999, Proceedings. Eighth Heterogeneous Computing Workshop (HCW'99).

[93]  Dave Turner,et al.  Protocol-dependent message-passing performance on Linux clusters , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[94]  Peng-Jun Wan,et al.  A Parallel Computational Model for Heterogeneous Clusters , 2006 .

[95]  Alexey L. Lastovetsky,et al.  New Model-Based Methods and Algorithms for Performance and Energy Optimization of Data Parallel Applications on Homogeneous Multicore Clusters , 2017, IEEE Transactions on Parallel and Distributed Systems.

[96]  Luis Pastor,et al.  A Parallel Computational Model for Heterogeneous Clusters , 2006, IEEE Transactions on Parallel and Distributed Systems.

[97]  Fukuhito Ooshita,et al.  Efficient gather operation in heterogeneous cluster systems , 2002, Proceedings 16th Annual International Symposium on High Performance Computing Systems and Applications.

[98]  Amotz Bar-Noy,et al.  Designing broadcasting algorithms in the postal model for message-passing systems , 1992, SPAA '92.

[99]  Alexey L. Lastovetsky,et al.  Extending τ-Lop to model concurrent MPI communications in multicore clusters , 2016, Future Gener. Comput. Syst..

[100]  Michael Anthony Bauer,et al.  Hpcbench - a Linux-based network benchmark for high performance networks , 2005, 19th International Symposium on High Performance Computing Systems and Applications (HPCS'05).

[101]  Viktor K. Prasanna,et al.  Adaptive communication algorithms for distributed heterogeneous systems , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).