Parallel Memory-Independent Communication Bounds for SYRK

In this paper, we focus on the parallel communication cost of multiplying a matrix with its transpose, known as a symmetric rank-k update (SYRK). SYRK requires half the computation of general matrix multiplication because of the symmetry of the output matrix. Recent work (Beaumont et al., SPAA '22) has demonstrated that the sequential I/O complexity of SYRK is also a constant factor smaller than that of general matrix multiplication. Inspired by this progress, we establish memory-independent parallel communication lower bounds for SYRK with smaller constants than general matrix multiplication, and we show that these constants are tight by presenting communication-optimal algorithms. The crux of the lower bound proof relies on extending a key geometric inequality to symmetric computations and analytically solving a constrained nonlinear optimization problem. The optimal algorithms use a triangular blocking scheme for parallel distribution of the symmetric output matrix and corresponding computation.

[1]  Olivier Beaumont,et al.  Symmetric Block-Cyclic Distribution: Fewer Communications Leads to Faster Dense Cholesky Factorization , 2022, SC22: International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Hussam Al Daas,et al.  Brief Announcement: Tight Memory-Independent Parallel Matrix Multiplication Communication Lower Bounds , 2022, SPAA.

[3]  Olivier Beaumont,et al.  I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels , 2022, SPAA.

[4]  Alexandros Nikolaos Ziogas,et al.  On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations , 2021, SC21: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  P. Sadayappan,et al.  IOOpt: automatic derivation of I/O complexity bounds for affine programs , 2021, PLDI.

[6]  Julien Langou,et al.  Automated derivation of parametric data movement lower bounds for affine programs , 2019, PLDI.

[7]  R. van de Geijn,et al.  A Tight I/O Lower Bound for Matrix Multiplication , 2017 .

[8]  James Demmel,et al.  Communication lower bounds and optimal algorithms for numerical linear algebra*† , 2014, Acta Numerica.

[9]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[10]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[11]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[12]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[13]  Robert A. van de Geijn,et al.  Collective communication: theory, practice, and experience , 2007, Concurr. Comput. Pract. Exp..

[14]  Yves Robert,et al.  Revisiting Matrix Product on Master-Worker Platforms , 2006, 2007 IEEE International Parallel and Distributed Processing Symposium.

[15]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[16]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[17]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[18]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[19]  Alok Aggarwal,et al.  Communication Complexity of PRAMs , 1990, Theor. Comput. Sci..

[20]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[21]  H. Whitney,et al.  An inequality related to the isoperimetric inequality , 1949 .

[22]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .