论文信息 - Automatic translation of MPI source into a latency-tolerant, data-driven form

Automatic translation of MPI source into a latency-tolerant, data-driven form

Hiding communication behind useful computation is an important performance programming technique but remains an inscrutable programming exercise even for the expert. We present Bamboo, a code transformation framework that can realize communication overlap in applications written in MPI without the need to intrusively modify the source code. We reformulate MPI source into a task dependency graph representation, which partially orders the tasks, enabling the program to execute in a data-driven fashion under the control of an external runtime system. Experimental results demonstrate that Bamboo significantly reduces communication delays while requiring only modest amounts of programmer annotation for a variety of applications and platforms, including those employing co-processors and accelerators. Moreover, Bamboos performance meets or exceeds that of labor-intensive hand coding. The translator is more than a means of hiding communication costs automatically; it demonstrates the utility of semantic level optimization against a well-known library. Bamboo is a translator that can reformulate MPI source into a task graph form.Bamboo supports both point-to-point and collective communication.Bamboo supports GPUs, hiding communication among GPUs and between hosts and GPUs.Bamboo speeds up applications containing elaborate data and control structures.

[1] M. Clemens,et al. Geometric multigrid method for electro- and magnetostatic field simulations using the conformal finite integration technique , 2003 .

[2] Robert A. van de Geijn,et al. Managing the complexity of lookahead for LU factorization with pivoting , 2010, SPAA '10.

[3] D. Marx. Ab initio molecular dynamics: Theory and Implementation , 2000 .

[4] Eric J. Bylaska,et al. Large‐Scale Plane‐Wave‐Based Density Functional Theory: Formalism, Parallelization, and Applications , 2011 .

[5] Dale R. Shires,et al. Program Flow Graph Construction For Static Analysis of MPI Programs , 1999, PDPTA.

[6] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[7] Scott B. Baden,et al. Communication overlap in multi-tier parallel algorithms , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[8] Wu-chun Feng,et al. On the efficacy of GPU-integrated MPI for scientific applications , 2013, HPDC '13.

[9] William Gropp,et al. The MPI Message-Passing Interface Standard: Overview and Status , 1995 .

[10] Vivek Sarkar,et al. Software challenges in extreme scale systems , 2009 .

[11] Thomas Hérault,et al. DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[12] Ken Kennedy,et al. Telescoping Languages: A System for Automatic Generation of Domain Languages , 2005, Proceedings of the IEEE.

[13] Wu-chun Feng,et al. MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[14] Aslak Tveito,et al. Numerical solution of partial differential equations on parallel computers , 2006 .

[15] Jack Dongarra,et al. ScaLAPACK Users' Guide , 1987 .

[16] Barry Wilkinson,et al. Parallel programming , 1998 .

[17] Markus Schordan,et al. Treating a user-defined parallel library as a domain-specific language , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[18] Vivek Sarkar,et al. Integrating Asynchronous Task Parallelism with MPI , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[19] Padma Raghavan,et al. A latency tolerant hybrid sparse solver using incomplete Cholesky factorization , 2003, Numer. Linear Algebra Appl..

[20] Katherine A. Yelick,et al. Multi-threading and one-sided communication in parallel LU factorization , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[21] John Shalf,et al. Exascale Computing Technology Challenges , 2010, VECPAR.

[22] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[23] Jack B. Dennis,et al. Data Flow Supercomputers , 1980, Computer.

[24] Scott B. Baden,et al. Latency Hiding and Performance Tuning with Graph-Based Execution , 2011, 2011 First Workshop on Data-Flow Execution Models for Extreme Scale Computing.

[25] Katherine A. Yelick,et al. Portable Runtime Support for Asynchronous Simulation , 1995, ICPP.

[26] Joseph E. Flaherty,et al. A hierarchical partition model for adaptive finite element computation , 2000 .

[27] Keshav Pingali,et al. Date movement and control substrate for parallel adaptive applications , 2002, Concurr. Comput. Pract. Exp..

[28] T. von Eicken,et al. Parallel programming in Split-C , 1993, Supercomputing '93.

[29] Pietro Cicotti. Tarragon : a programming model for latency-hiding scientific computations , 2011 .

[30] Padma Raghavan,et al. A New Data-Mapping Scheme for Latency-Tolerant Distributed Sparse Triangular Solution , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[31] William L. Briggs,et al. A multigrid tutorial , 1987 .

[32] Eduard Ayguadé,et al. Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[33] Scott B. Baden,et al. Bamboo -- Translating MPI applications to a latency-tolerant, data-driven form , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[34] Jack J. Dongarra,et al. The LINPACK Benchmark: past, present and future , 2003, Concurr. Comput. Pract. Exp..

[35] Scott B. Baden,et al. Hiding Communication Latency with Non-SPMD, Graph-Based Execution , 2009, ICCS.

[36] Michael J. Holst,et al. A New Paradigm for Parallel Adaptive Meshing Algorithms , 2000, SIAM J. Sci. Comput..

[37] Paul D. Hovland,et al. Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[38] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[39] Jack J. Dongarra,et al. The LINPACK Benchmark: An Explanation , 1988, ICS.

[40] P. Wesseling,et al. Geometric multigrid with applications to computational fluid dynamics , 2001 .

[41] Joel H. Saltz,et al. Distributed processing of very large datasets with DataCutter , 2001, Parallel Comput..

[42] Samuel Williams,et al. Optimization of geometric multigrid for emerging multi- and manycore processors , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[43] Laxmikant V. Kalé,et al. CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[44] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[45] Erik H. D'Hollander,et al. Applications, Tools and Techniques on the Road to Exascale Computing, Proceedings of the conference ParCo 2011, 31 August - 3 September 2011, Ghent, Belgium , 2012, PARCO.

[46] Martin Schulz,et al. Using MPI Communication Patterns to Guide Source Code Transformations , 2008, ICCS.

[47] Scott B. Baden,et al. Asynchronous programming with Tarragon , 2006, SC.

[48] D. Martin Swany,et al. Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[49] Tiarajú Asmuz Diverio,et al. Automatic data-flow graph generation of MPI programs , 2005, 17th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'05).

[50] Ken Kennedy,et al. KELPIO a telescope-ready domain-specific I/O library for irregular block-structured applications , 2002, Future Gener. Comput. Syst..

[51] Laxmikant V. Kalé,et al. Mapping Dense LU Factorization on Multicore Supercomputer Nodes , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[52] Michael J. Quinn,et al. Parallel programming in C with MPI and OpenMP , 2003 .

[53] Calvin Lin,et al. An annotation language for optimizing software libraries , 1999, DSL '99.

[54] Katherine A. Yelick,et al. Communication optimizations for fine-grained UPC applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[55] Aslak Tveito,et al. Numerical Solution of Partial Differential Equations on Parallel Computers (Lecture Notes in Computational Science and Engineering) , 2006 .

[56] Rupak Biswas,et al. Communication Studies of DMP and SMP Machines , 1997 .

[57] Vipin Kumar,et al. Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[58] P. Colella,et al. A local corrections algorithm for solving Poisson’s equation in three dimensions , 2006 .

[59] Arun K. Somani,et al. Minimizing overhead in parallel algorithms through overlapping communication/computation , 1997 .

[60] Yifeng Chen,et al. Large-scale FFT on GPU clusters , 2010, ICS '10.

[61] Scott B. Baden,et al. LU Factorization: Towards Hiding Communication Overheads with a Lookahead-Free Algorithm , 2015, 2015 IEEE International Conference on Cluster Computing.

[62] Arvind,et al. Executing a Program on the MIT Tagged-Token Dataflow Architecture , 1990, IEEE Trans. Computers.