QR factorization with Morton-ordered quadtree matrices for memory re-use and parallelism

Quadtree matrices using Morton-order storage provide natural blocking on every level of a memory hierarchy. Writing the natural recursive algorithms to take advantage of this blocking results in code that honors the memory hierarchy without the need for transforming the code. Furthermore, the divide-and-conquer algorithm breaks problems down into independent computations. These independent computations can be dispatched in parallel for straightforward parallel processing.Proof-of-concept is given by an algorithm for QR factorization based on Givens rotations for quadtree matrices in Morton-order storage. The algorithms deliver positive results, competing with and even beating the LAPACK equivalent.

[1]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[2]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[3]  J. Davenport Editor , 1960 .

[4]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[5]  Mithuna Thottethodi,et al.  Recursive Array Layouts and Fast Matrix Multiplication , 2002, IEEE Trans. Parallel Distributed Syst..

[6]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[7]  Martin C. Rinard,et al.  Recursion Unrolling for Divide and Conquer Programs , 2000, LCPC.

[8]  Jeremy D. Frens,et al.  Matrix factorization using a block-recursive structure and block-recursive algorithms , 2002 .

[9]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[10]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[11]  Gene H. Golub,et al.  Matrix computations , 1983 .

[12]  Erik Elmroth,et al.  Applying recursion to serial and parallel QR factorization leads to better performance , 2000, IBM J. Res. Dev..

[13]  David S. Wise Ahnentafel Indexing into Morton-Ordered Arrays, or Matrix Locality for Free , 2000, Euro-Par.

[14]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[15]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[16]  John Darlington,et al.  A Transformation System for Developing Recursive Programs , 1977, J. ACM.

[17]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[18]  Ivan Dimov,et al.  Advances in Parallel Algorithms , 1994 .

[19]  Fumihiko Ino,et al.  LogGPS: a parallel computational model for synchronization analysis , 2001, PPoPP '01.

[20]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[21]  Ramesh Subramonian,et al.  LogP: a practical model of parallel computation , 1996, CACM.

[22]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[23]  David S. Wise Undulant-Block Elimination and Integer-Preserving Matrix Inversion , 1999, Sci. Comput. Program..

[24]  Albert Y. Zomaya Parallel and Distributed Computing Handbook , 1995 .

[25]  Donald E. Knuth,et al.  The Art of Computer Programming, Volume I: Fundamental Algorithms, 2nd Edition , 1997 .

[26]  Jeremy D. Frens,et al.  Language support for Morton-order matrices , 2001, PPoPP '01.

[27]  Tom Axford,et al.  The divide-and-conquer paradigm as a basis for parallel language design , 1992 .

[28]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[29]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.