A Survey of Methods for Collective Communication Optimization and Tuning

New developments in HPC technology in terms of increasing computing power on multi/many core processors, high-bandwidth memory/IO subsystems and communication interconnects, pose a direct impact on software and runtime system development. These advancements have become useful in producing high-performance collective communication interfaces that integrate efficiently on a wide variety of platforms and environments. However, number of optimization options that shows up with each new technology or software framework has resulted in a \emph{combinatorial explosion} in feature space for tuning collective parameters such that finding the optimal set has become a nearly impossible task. Applicability of algorithmic choices available for optimizing collective communication depends largely on the scalability requirement for a particular usecase. This problem can be further exasperated by any requirement to run collective problems at very large scales such as in the case of exascale computing, at which impractical tuning by brute force may require many months of resources. Therefore application of statistical, data mining and artificial Intelligence or more general hybrid learning models seems essential in many collectives parameter optimization problems. We hope to explore current and the cutting edge of collective communication optimization and tuning methods and culminate with possible future directions towards this problem.

[1]  Jim Euchner Design , 2014, Catalysis from A to Z.

[2]  Katherine A. Yelick,et al.  Titanium: A High-performance Java Dialect , 1998, Concurr. Pract. Exp..

[3]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[4]  Martin Schulz,et al.  PNMPI tools: a whole lot greater than the sum of their parts , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[5]  Jon Louis Bentley,et al.  Quad trees a data structure for retrieval on composite keys , 1974, Acta Informatica.

[6]  Steve Poole,et al.  ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[7]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[8]  Torsten Hoefler,et al.  PEMOGEN: Automatic adaptive performance modeling during program runtime , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[9]  Ying Qian,et al.  Efficient shared memory and RDMA based collectives on multi-rail QsNetII SMP clusters , 2008, Cluster Computing.

[10]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Jack J. Dongarra,et al.  Decision Trees and MPI Collective Algorithm Selection Problem , 2007, Euro-Par.

[12]  Matthew N. Anyanwu,et al.  Comparative Analysis of Serial Decision Tree Classification Algorithms , 2009 .

[13]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[14]  Armin R. Mikler,et al.  NetPIPE: A Network Protocol Independent Performance Evaluator , 1996 .

[15]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[16]  Xin Yuan,et al.  CC--MPI: a compiled communication capable MPI prototype for ethernet switched clusters , 2003, PPoPP '03.

[17]  Torsten Hoefler,et al.  Exploiting Offload-Enabled Network Interfaces , 2015, IEEE Micro.

[18]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[19]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[20]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[21]  James Demmel,et al.  Statistical Models for Empirical Search-Based Performance Tuning , 2004, Int. J. High Perform. Comput. Appl..

[22]  Dan Bonachea GASNet Specification, v1.1 , 2002 .

[23]  Katherine Yelick,et al.  Introduction to UPC and Language Specification , 2000 .

[24]  Edgar Gabriel,et al.  A Tool for Optimizing Runtime Parameters of Open MPI , 2008, PVM/MPI.

[25]  Kewei Cheng,et al.  Feature Selection , 2016, ACM Comput. Surv..

[26]  D. Martin Swany,et al.  Gravel: A Communication Library to Fast Path MPI , 2008, PVM/MPI.

[27]  Patricia J. Teller,et al.  MPI Advisor: a Minimal Overhead Tool for MPI Library Performance Tuning , 2015, EuroMPI.

[28]  Jie Wang,et al.  Optimizing MPI Runtime Parameter Settings by Using Machine Learning , 2009, PVM/MPI.

[29]  Thomas G. Dietterich,et al.  Machine Learning Bias, Statistical Bias, and Statistical Variance of Decision Tree Algorithms , 2008 .

[30]  Robert Hecht-Nielsen,et al.  Theory of the backpropagation neural network , 1989, International 1989 Joint Conference on Neural Networks.

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[33]  Katherine Yelick,et al.  Titanium: a high-performance Java dialect , 1998 .

[34]  Torsten Hoefler,et al.  Using Compiler Techniques to Improve Automatic Performance Modeling , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[35]  Xin Yuan,et al.  Automatic generation and tuning of MPI collective communication routines , 2005, ICS '05.

[36]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[37]  Sushmitha P. Kini,et al.  Fast and Scalable Barrier Using RDMA and Multicast Mechanisms for InfiniBand-Based Clusters , 2003, PVM/MPI.

[38]  Lior Rokach,et al.  Pattern Classification Using Ensemble Methods , 2009, Series in Machine Perception and Artificial Intelligence.

[39]  Sayantan Sur,et al.  Design and Evaluation of Generalized Collective Communication Primitives with Overlap Using ConnectX-2 Offload Engine , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[40]  Emilio Corchado,et al.  A survey of multiple classifier systems as hybrid systems , 2014, Inf. Fusion.

[41]  Manish Gupta,et al.  Compiler-controlled extraction of computation-communication overlap in MPI applications , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[42]  Torsten Hoefler,et al.  Principles for coordinated optimization of computation and communication in large-scale parallel systems , 2008 .

[43]  Jason Duell,et al.  The Lam/Mpi Checkpoint/Restart Framework: System-Initiated Checkpointing , 2005, Int. J. High Perform. Comput. Appl..

[44]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[45]  Lori Pollock,et al.  Implementing an Open 64-based Tool for Improving the Performance of MPI Programs , 2008 .

[46]  D. Panda,et al.  Efficient Barrier and Allreduce on IBA clusters using hardware multicast and adaptive algorithms , 2004 .

[47]  Nectarios Koziris,et al.  A pipelined schedule to minimize completion time for loop tiling with computation and communication overlapping , 2003, J. Parallel Distributed Comput..

[48]  Martin Schulz,et al.  Formal analysis of MPI-based parallel programs , 2011, Commun. ACM.

[49]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[50]  Darren J. Kerbyson,et al.  Improving the Performance of Multiple Conjugate Gradient Solvers by Exploiting Overlap , 2008, Euro-Par.

[51]  K. J. Ottenstein,et al.  Data-flow graphs as an intermediate program form. , 1978 .

[52]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[53]  D. Martin Swany,et al.  Photon: Remote Memory Access Middleware for High-Performance Runtime Systems , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[54]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[55]  Torsten Hoefler,et al.  Using automated performance modeling to find scalability bugs in complex codes , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[56]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[57]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[58]  D. Martin Swany,et al.  Automatic MPI application transformation with ASPhALT , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[59]  Kees Verstoep,et al.  Fast Measurement of LogP Parameters for Message Passing Platforms , 2000, IPDPS Workshops.

[60]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[61]  Wei Chen,et al.  Message Strip-Mining Heuristics for High Speed Networks , 2004, VECPAR.

[62]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[63]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[64]  D. Martin Swany,et al.  MPI-aware compiler optimizations for improving communication-computation overlap , 2009, ICS.

[65]  G. Fagg,et al.  Flexible collective communication tuning architecture applied to Open MPI , 2006 .

[66]  Luis Díaz de Cerio,et al.  A Method for Exploiting Communication/Computation Overlap in Hypercubes , 1998, Parallel Comput..

[67]  Greg Bronevetsky,et al.  Communication-Sensitive Static Dataflow for Parallel Message Passing Applications , 2009, 2009 International Symposium on Code Generation and Optimization.

[68]  Jack J. Dongarra,et al.  MPI Collective Algorithm Selection and Quadtree Encoding , 2006, PVM/MPI.

[69]  E. Smith Methods of Multivariate Analysis , 1997 .

[70]  D. Qainlant,et al.  ROSE: Compiler Support for Object-Oriented Frameworks , 1999 .

[71]  Torsten Hoefler,et al.  Design, Implementation, and Usage of LibNBC , 2006 .

[72]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[73]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[74]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[75]  Torsten Hoefler,et al.  Energy, Memory, and Runtime Tradeoffs for Implementing Collective Communication Operations , 2014, Supercomput. Front. Innov..

[76]  Dipl.-Inf. Torsten Hoefler,et al.  A Survey of Barrier Algorithms for Coarse Grained Supercomputers , 2005 .

[77]  PattersonDavid,et al.  LogP: towards a realistic model of parallel computation , 1993 .

[78]  Jelena Pjesivac-Grbovic,et al.  Towards Automatic and Adaptive Optimizations of MPI Collective Operations , 2007 .

[79]  David E. Culler,et al.  U-Net/SLE: A Java-based user-customizable virtual network interface , 1999, Sci. Program..

[80]  Torsten Hoefler,et al.  Optimizing a conjugate gradient solver with non-blocking collective operations , 2006, Parallel Comput..

[81]  Paul D. Hovland,et al.  Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[82]  Torsten Hoefler,et al.  Leveraging non-blocking collective communication in high-performance applications , 2008, SPAA '08.

[83]  Rolf Rabenseifner,et al.  Automatic Profiling of MPI Applications with Hardware Performance Counters , 1999, PVM/MPI.

[84]  Torsten Hoefler,et al.  Automatic Performance Modeling of HPC Applications , 2016, Software for Exascale Computing.

[85]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.

[86]  Lori Pollock,et al.  Program Flow Graph Construction for Static Analysis of Explicitly Parallel Message-Passing Programs , 2000 .