Matrix factorization using a block-recursive structure and block-recursive algorithms
暂无分享,去创建一个
The divide-and-conquer paradigm yields algorithms that parallelize easily, a very important consideration in high-performance computing. However, high-performance computing also relies on local reuse of data in a memory hierarchy of registers, caches, main memory, and swapping disks.
I have worked with a sequential representation of quadtree matrices that is a divide-and-conquer data structure. This representation uses an indexing scheme, Morton ordering, that automatically blocks the elements of the matrix and promotes memory locality in recursive algorithms over the quadtree. This work focuses on using this representation for two important matrix algorithms, matrix multiplication and QR factorization.
Techniques for generating independent and local code were discovered while implementing a recursive matrix-matrix multiplication over the sequential quadtree matrix. For good memory locality the basic multiplication routine was written as two dual, mutually recursive functions with the recursive calls in each version precisely ordered to reuse data already in cache.
To avoid unnecessary work, minimal decorations annotate each submatrix of the matrix; these decorations are used to ignore zero blocks and to ellide the zero tests on blocks known to be dense. Finally, a strategy was developed for dispatching parallel processes in the multiplication algorithm.
These lessons were then used to develop an efficient QR factorization algorithm for quadtree matrices. It also uses recursive functions that localize data in blocks and that balance parallel processing.
The sequential quadtree matrix and multiplication functions are implemented in C because of its optimizing compilers. Since C optimizers favor iterative code and are deficient for recursive code, some base optimizations were done by hand. These hand optimizations could and should be done by some future optimizing compiler.
Results are mostly favorable; while the quadtree-matrix algorithms do not always perform at the same level as decades-old routines (i.e., BLAS and LAPACK), which are fine-tuned for traditional storage, the quadtree algorithms are competitive and even expose some flaws in these old routines.