A Paradigm for Parallel Matrix Algorithms:

A style for programming problems from matrix algebra is developed with a familiar example and new tools, yielding high performance with a couple of surprising exceptions. The underlying philosophy is to use block recursion as the exclusive control structure, down to a 2p× 2p base case anyway, where hardware favors iterative style to fill its pipe. Use of Morton-ordered matrices yields excellent locality within the memory hierarchy—including block sharing among distributed computers. The recursion generalizes nicely to an SPMD program where such sharing is the only communication. Cholesky factorization of an n × n SPD matrix is used as a simple nontrivial example to expose the paradigm. The program amounts to four functions, two of which are finalizers for the other two. This insight allows final blocks to be shared with inter-node communication ∈ Θ(n2) for this algorithm ∈ Θ (n3) flops.

[1]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[2]  David S. Johnson,et al.  A theoretician's guide to the experimental analysis of algorithms , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[3]  Jeremy D. Frens,et al.  Matrix factorization using a block-recursive structure and block-recursive algorithms , 2002 .

[4]  Paul H. J. Kelly,et al.  Is Morton layout competitive for large two‐dimensional arrays yet? , 2006, Concurr. Comput. Pract. Exp..

[5]  Jürgen Spieß Untersuchungen des Zeitgewinns durch neue Algorithmen zur Matrix-Multiplikation , 2005, Computing.

[6]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[7]  Paul H. J. Kelly,et al.  Is Morton layout competitive for large two-dimensional arrays yetq: Research Articles , 2006 .

[8]  N. P. Drakenberg,et al.  An Efficient Semi-Hierarchical Array Layout , 2001 .

[9]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[10]  D. B. Davis,et al.  Intel Corp. , 1993 .

[11]  K. D. Tocher The Application of Automatic Computers to Sampling Experiments , 1954 .

[12]  Günther F. Schrack,et al.  Finding neighbors of equal size in linear quadtrees and octrees in constant time , 1991, CVGIP Image Underst..

[13]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[14]  Wolfgang Karl,et al.  Euro-Par 2000 Parallel Processing , 2000, Lecture Notes in Computer Science.

[15]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.