论文信息 - Improving communication performance in dense linear algebra via topology aware collectives

Improving communication performance in dense linear algebra via topology aware collectives

Recent results have shown that topology aware mapping reduces network contention in communication-intensive kernels on massively parallel machines. We demonstrate that on mesh interconnects, topology aware mapping also allows for the utilization of highly-efficient topology aware collectives. We map novel 2.5D dense linear algebra algorithms to exploit rectangular collectives on cuboid partitions allocated by a Blue Gene/P supercomputer. Our mappings allow the algorithms to exploit optimized line multicasts and reductions. Commonly used 2D algorithms cannot be mapped in this fashion. On 16,384 nodes (65,536 cores) of Blue Gene/P, 2.5D algorithms that exploit rectangular collectives are sig- nificantly faster than 2D matrix multiplication (MM) and LU factorization, up to 8.7x and 2.1x, respectively. These speed-ups are due to communication reduction (up to 95.6% for 2.5D MM with respect to 2D MM). We also derive LogP- based novel performance models for rectangular broadcasts and reductions. Using those, we model the performance of matrix multiplication and LU factorization on a hypothetical exascale architecture.

[1] S. Dosanjh,et al. Architectures and Technology for Extreme Scale Computing Report from the Workshop Node Architecture and Power Reduction Strategies , 2011 .

[2] S. Lennart Johnsson,et al. Optimum Broadcasting and Personalized Communication in Hypercubes , 1989, IEEE Trans. Computers.

[3] Philip K. McKinley,et al. Collective Communication in Wormhole-Routed Massively Parallel Computers , 1995, Computer.

[4] Robert A. van de Geijn,et al. SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[5] James Demmel,et al. CALU: A Communication Optimal LU Factorization Algorithm , 2011, SIAM J. Matrix Anal. Appl..

[6] Rajeev Thakur,et al. Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[7] Robert A. van de Geijn,et al. Broadcasting on Meshes with Wormhole Routing , 1996, J. Parallel Distributed Comput..

[8] Lynn Elliot Cannon,et al. A cellular computer to implement the kalman filter algorithm , 1969 .

[9] William J. Dally,et al. Technology-Driven, Highly-Scalable Dragonfly Topology , 2008, 2008 International Symposium on Computer Architecture.

[10] Josep Torrellas. Architectures for Extreme-Scale Computing , 2009, Computer.

[11] Amith R. Mamidala,et al. MPI Collective Communications on The Blue Gene/P Supercomputer: Algorithms and Optimizations , 2009, 2009 17th IEEE Symposium on High Performance Interconnects.

[12] James Demmel,et al. Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[13] W HockneyRoger. The communication challenge for MPP , 1994 .

[14] Dror Irony,et al. Trading Replication for Communication in Parallel Distributed-Memory Dense Solvers , 2002, Parallel Process. Lett..

[15] Roger W. Hockney,et al. The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[16] Jack J. Dongarra,et al. Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[17] Jack Dongarra,et al. ScaLAPACK user's guide , 1997 .

[18] James Demmel,et al. Communication avoiding Gaussian elimination , 2008, HiPC 2008.

[19] P. Sadayappan,et al. Communication efficient matrix multiplication on hypercubes , 1994, SPAA '94.

[20] D. G. Payne,et al. Broadcasting on Meshes with Worm-hole Routing , 1996 .

[21] Ramesh Subramonian,et al. LogP: towards a realistic model of parallel computation , 1993, PPOPP '93.

[22] Chris J. Scheiman,et al. LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[23] Ramesh C. Agarwal,et al. A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[24] P. Sadayappan,et al. Communication-Efficient Matrix Multiplication on Hypercubes , 1996, Parallel Comput..

[25] Robert A. van de Geijn,et al. SUMMA: Scalable Universal Matrix Multiplication Algorithm , 1995 .

[26] Alok Aggarwal,et al. Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[27] Philip Heidelberger,et al. The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[28] James Demmel,et al. Minimizing Communication in Linear Algebra , 2009, ArXiv.