论文信息 - Lawrence Berkeley National Laboratory Recent Work Title Automatic translation of MPI source into a latency-tolerant , data-driven form Permalink

Lawrence Berkeley National Laboratory Recent Work Title Automatic translation of MPI source into a latency-tolerant , data-driven form Permalink

Hiding communication behind useful computation is an important performance programming technique but remains an inscrutable programming exercise even for the expert. We present Bamboo, a code transformation framework that can realize communication overlap in applicationswritten inMPIwithout the need to intrusively modify the source code. We reformulate MPI source into a task dependency graph representation, which partially orders the tasks, enabling the program to execute in a data-driven fashion under the control of an external runtime system. Experimental results demonstrate that Bamboo significantly reduces communication delays while requiring only modest amounts of programmer annotation for a variety of applications and platforms, including those employing co-processors and accelerators. Moreover, Bamboo’s performancemeets or exceeds that of labor-intensive hand coding. The translator is more than a means of hiding communication costs automatically; it demonstrates the utility of semantic level optimization against a well-known library. © 2017 Elsevier Inc. All rights reserved.

D. Quinlan | E. Bylaska | Pietro Cicotti | S. Baden | T. Nguyen

[1] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.

[2] Miss A.O. Penney. (b) , 1974, The New Yale Book of Quotations.

[3] Jack B. Dennis,et al. Data Flow Supercomputers , 1980, Computer.

[4] Jack Dongarra,et al. LINPACK Users' Guide , 1987 .

[5] William L. Briggs,et al. A multigrid tutorial , 1987 .

[6] Jack J. Dongarra,et al. The LINPACK Benchmark: An Explanation , 1988, ICS.

[7] Arvind,et al. Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.

[8] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[9] Katherine A. Yelick,et al. Portable Runtime Support for Asynchronous Simulation , 1995, ICPP.

[10] Vipin Kumar,et al. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[11] Arun K. Somani,et al. Minimizing overhead in parallel algorithms through overlapping communication/computation , 1997 .

[12] Rupak Biswas,et al. Communication Studies of DMP and SMP Machines , 1997 .

[13] Scott B. Baden,et al. Communication overlap in multi-tier parallel algorithms , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14] Dale R. Shires,et al. Program Flow Graph Construction For Static Analysis of MPI Programs , 1999, PDPTA.

[15] Calvin Lin,et al. An annotation language for optimizing software libraries , 1999, DSL '99.

[16] Michael J. Holst,et al. A New Paradigm for Parallel Adaptive Meshing Algorithms , 2000, SIAM J. Sci. Comput..

[17] D. Marx. Ab initio molecular dynamics: Theory and Implementation , 2000 .

[18] Joseph E. Flaherty,et al. A hierarchical partition model for adaptive finite element computation , 2000 .

[19] P. Wesseling,et al. Geometric multigrid with applications to computational fluid dynamics , 2001 .

[20] Ken Kennedy,et al. KelpIO: a telescope-ready domain-specific I/O library for irregular block-structured applications , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[21] Joel H. Saltz,et al. Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[22] Keshav Pingali,et al. Date movement and control substrate for parallel adaptive applications , 2002, Concurr. Comput. Pract. Exp..

[23] Markus Schordan,et al. Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[24] Padma Raghavan,et al. A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[25] Padma Raghavan,et al. A latency tolerant hybrid sparse solver using incomplete Cholesky factorization , 2003, Numer. Linear Algebra Appl..

[26] Laxmikant V. Kalé,et al. Adaptive MPI , 2003, LCPC.

[27] Michael J. Quinn,et al. Parallel programming in C with MPI and OpenMP , 2003 .

[28] M. Clemens,et al. Geometric multigrid method for electro- and magnetostatic field simulations using the conformal finite integration technique , 2003 .

[29] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[30] Katherine A. Yelick,et al. Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[31] Ken Kennedy,et al. Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[32] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[33] Tiarajú Asmuz Diverio,et al. Automatic data-flow graph generation of MPI programs , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[34] D. Martin Swany,et al. Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[35] Aslak Tveito,et al. Numerical solution of partial differential equations on parallel computers , 2006 .

[36] Paul D. Hovland,et al. Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[37] P. Colella,et al. A local corrections algorithm for solving Poisson’s equation in three dimensions , 2006 .

[38] Katherine A. Yelick,et al. Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[39] J. Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[40] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .

[41] Scott B. Baden,et al. Hiding Communication Latency with Non-SPMD, Graph-Based Execution , 2009, ICCS.

[42] Eduard Ayguadé,et al. Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[43] Yifeng Chen,et al. Large-scale FFT on GPU clusters , 2010, ICS '10.

[44] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.

[45] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[46] Pietro Cicotti. Tarragon : a programming model for latency-hiding scientific computations , 2011 .

[47] Scott B. Baden,et al. Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[48] Thomas Hérault,et al. DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[49] Laxmikant V. Kalé,et al. Mapping Dense LU Factorization on Multicore Supercomputer Nodes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[50] Erik H. D'Hollander,et al. Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium , 2012, PARCO.

[51] Scott B. Baden,et al. Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[52] S. Baden,et al. Bamboo-Preliminary scaling results on multiple hybrid nodes of Knights Corner and Sandy Bridge processors , 2013 .

[53] Vivek Sarkar,et al. Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.