Lawrence Berkeley National Laboratory Recent Work Title Automatic translation of MPI source into a latency-tolerant , data-driven form Permalink

Hiding communication behind useful computation is an important performance programming technique but remains an inscrutable programming exercise even for the expert. We present Bamboo, a code transformation framework that can realize communication overlap in applicationswritten inMPIwithout the need to intrusively modify the source code. We reformulate MPI source into a task dependency graph representation, which partially orders the tasks, enabling the program to execute in a data-driven fashion under the control of an external runtime system. Experimental results demonstrate that Bamboo significantly reduces communication delays while requiring only modest amounts of programmer annotation for a variety of applications and platforms, including those employing co-processors and accelerators. Moreover, Bamboo’s performancemeets or exceeds that of labor-intensive hand coding. The translator is more than a means of hiding communication costs automatically; it demonstrates the utility of semantic level optimization against a well-known library. © 2017 Elsevier Inc. All rights reserved.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[3]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[4]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[5]  William L. Briggs,et al.  A multigrid tutorial , 1987 .

[6]  Jack J. Dongarra,et al.  The LINPACK Benchmark: An Explanation , 1988, ICS.

[7]  Arvind,et al.  Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[8]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[9]  Katherine A. Yelick,et al.  Portable Runtime Support for Asynchronous Simulation , 1995, ICPP.

[10]  Vipin Kumar,et al.  Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[11]  Arun K. Somani,et al.  Minimizing overhead in parallel algorithms through overlapping communication/computation , 1997 .

[12]  Rupak Biswas,et al.  Communication Studies of DMP and SMP Machines , 1997 .

[13]  Scott B. Baden,et al.  Communication overlap in multi-tier parallel algorithms , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  Dale R. Shires,et al.  Program Flow Graph Construction For Static Analysis of MPI Programs , 1999, PDPTA.

[15]  Calvin Lin,et al.  An annotation language for optimizing software libraries , 1999, DSL '99.

[16]  Michael J. Holst,et al.  A New Paradigm for Parallel Adaptive Meshing Algorithms , 2000, SIAM J. Sci. Comput..

[17]  D. Marx Ab initio molecular dynamics: Theory and Implementation , 2000 .

[18]  Joseph E. Flaherty,et al.  A hierarchical partition model for adaptive finite element computation , 2000 .

[19]  P. Wesseling,et al.  Geometric multigrid with applications to computational fluid dynamics , 2001 .

[20]  Ken Kennedy,et al.  KelpIO: a telescope-ready domain-specific I/O library for irregular block-structured applications , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[21]  Joel H. Saltz,et al.  Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[22]  Keshav Pingali,et al.  Date movement and control substrate for parallel adaptive applications , 2002, Concurr. Comput. Pract. Exp..

[23]  Markus Schordan,et al.  Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24]  Padma Raghavan,et al.  A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[25]  Padma Raghavan,et al.  A latency tolerant hybrid sparse solver using incomplete Cholesky factorization , 2003, Numer. Linear Algebra Appl..

[26]  Laxmikant V. Kalé,et al.  Adaptive MPI , 2003, LCPC.

[27]  Michael J. Quinn,et al.  Parallel programming in C with MPI and OpenMP , 2003 .

[28]  M. Clemens,et al.  Geometric multigrid method for electro- and magnetostatic field simulations using the conformal finite integration technique , 2003 .

[29]  Jack J. Dongarra,et al.  The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[30]  Katherine A. Yelick,et al.  Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[31]  Ken Kennedy,et al.  Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[32]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[33]  Tiarajú Asmuz Diverio,et al.  Automatic data-flow graph generation of MPI programs , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[34]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[35]  Aslak Tveito,et al.  Numerical solution of partial differential equations on parallel computers , 2006 .

[36]  Paul D. Hovland,et al.  Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[37]  P. Colella,et al.  A local corrections algorithm for solving Poisson’s equation in three dimensions , 2006 .

[38]  Katherine A. Yelick,et al.  Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[39]  J. Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[40]  Vivek Sarkar,et al.  Software challenges in extreme scale systems , 2009 .

[41]  Scott B. Baden,et al.  Hiding Communication Latency with Non-SPMD, Graph-Based Execution , 2009, ICCS.

[42]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[43]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[44]  John Shalf,et al.  Exascale Computing Technology Challenges , 2010, VECPAR.

[45]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[46]  Pietro Cicotti Tarragon : a programming model for latency-hiding scientific computations , 2011 .

[47]  Scott B. Baden,et al.  Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[48]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[49]  Laxmikant V. Kalé,et al.  Mapping Dense LU Factorization on Multicore Supercomputer Nodes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[50]  Erik H. D'Hollander,et al.  Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium , 2012, PARCO.

[51]  Scott B. Baden,et al.  Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[52]  S. Baden,et al.  Bamboo-Preliminary scaling results on multiple hybrid nodes of Knights Corner and Sandy Bridge processors , 2013 .

[53]  Vivek Sarkar,et al.  Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.