Poster: Beating MKL and ScaLAPACK at Rectangular Matrix Multiplication Using the BFS/DFS Approach

We implement a Communication Avoiding Recursive Matrix Multiplication algorithm (CARMA) . First communication-optimal parallel algorithm for all dimensions of matrices . The shared-memory version of CARMA is only '-50 lines of code . Much simpler than 3D SUMMA [8], the rectangular version of 2.5D [9] . Fasterthan MKL and ScaLAPACK in practice: . Faster for skinny matrices in which k is the largest dimension: up to 7X speedup single-node, 141X speedup distributed o Faster for large square matrices: up to 1.2χ speedup single-node, 3X speedup distributed o Comparable performance for other matrix dimensions o Speedup is mainly due to reduced communication (see bar charts in Performance Results").

[1]  James Demmel,et al.  Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[2]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[3]  John Shalf,et al.  SEJITS: Getting Productivity and Performance With Selective Embedded JIT Specialization , 2010 .

[4]  James Demmel,et al.  Brief announcement: strong scaling of matrix multiplication algorithms and memory-independent communication lower bounds , 2012, SPAA '12.

[5]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[7]  James Demmel,et al.  Communication-optimal parallel algorithm for strassen's matrix multiplication , 2012, SPAA '12.

[8]  James Demmel,et al.  Communication-Avoiding Parallel Strassen: Implementation and performance , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.