COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling

Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided. Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements A = α ·op(B) +β ·A, op ∈ {transpose, conjugate-transpose, identity} on distributed systems, where A,B are matrices with potentially different (distributed) layouts and α, β are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. Moreover, our algorithm can be easily generalized to even more generic problems, making it suitable for distributed Machine Learning applications. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.

[1]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  J. Munkres ALGORITHMS FOR THE ASSIGNMENT AND TRANSIORTATION tROBLEMS* , 1957 .

[3]  Rakesh Nagi,et al.  GPU-accelerated Hungarian algorithms for the Linear Assignment Problem , 2016, Parallel Comput..

[4]  Harold W. Kuhn,et al.  The Hungarian method for the assignment problem , 1955, 50 Years of Integer Programming.

[5]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[6]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[7]  Angelika Steger,et al.  Fast Algorithms for Weighted Bipartite Matching , 2005, WEA.

[8]  Bernard Tourancheau,et al.  Efficient Block Cyclic Data Redistribution , 1996, Euro-Par, Vol. I.

[9]  Thomas Hérault,et al.  Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results , 2016, Parallel Comput..

[10]  Torsten Hoefler,et al.  Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication , 2019, SC.

[11]  Aleksandar Ilic,et al.  Fast block distributed CUDA implementation of the Hungarian algorithm , 2019, J. Parallel Distributed Comput..

[12]  Rustam Z. Khaliullin,et al.  CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations. , 2020, The Journal of chemical physics.

[13]  Peter Messmer,et al.  Enabling simulation at the fifth rung of DFT: Large scale RPA calculations with excellent time to solution , 2015, Comput. Phys. Commun..

[14]  Thilo Kielmann,et al.  Bandwidth-Latency Models (BSP, LogP) , 2011, Encyclopedia of Parallel Computing.

[15]  Elisa Pappalardo,et al.  Handbook of Combinatorial Optimization , 2013 .

[16]  Jack J. Dongarra,et al.  Software Libraries for Linear Algebra Computations on High Performance Computers , 1995, SIAM Rev..