Architecture-Cognizant Divide and Conquer Algorithms

Divide and conquer programs can achieve good performance on parallel computers and computers with deep memory hierarchies. We introduce architecture-cognizant divide and conquer algorithms, and explore how they can achieve even better performance. An architecture-cognizant algorithm has functionally-equivalent variants of the divide and/or combine functions, and a variant policy that specifies which variant to use at each level of recursion. An optimal variant policy is chosen for each target computer via experimentation. With h levels of recursion, an exhaustive search requires \theta(vh) experiments (where v is the number of variants). We present a method based on dynamic programming that reduces this to \theta(vc) (where c is typically a small constant) experiments for a class of architecture-cognizant programs. We verify our technique on two kernels (matrix multiply and 2-D Point Jacobi) using three architectures. Our technique improves performance by up to a factor of two, compared to architecture-oblivious divide and conquer implementations. Further our dynamic programming approach succeeds in selecting the optimal variant policy.

[1]  Bowen Alpern,et al.  Space-limited procedures: a methodology for portable high-performance , 1995, Programming Models for Massively Parallel Computers.

[2]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[3]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[4]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[5]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[6]  Mithuna Thottethodi,et al.  Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[7]  Steven G. Johnson,et al.  The Fastest Fourier Transform in the West , 1997 .

[8]  Bowen Alpern,et al.  Rectilinear Steiner Tree Minimization on a Workstation , 1992, Computational Support for Discrete Mathematics.

[9]  Matteo Frigo,et al.  Portable high-performance programs , 1999 .

[10]  Murray Cole,et al.  Algorithmic skeletons : a structured approach to the management of parallel computation , 1988 .

[11]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[12]  Bernd Freisleben,et al.  Automatic Parallelization of Divide-and-Conquer Algorithms , 1992, CONPAR.

[13]  Martin C. Rinard,et al.  Automatic parallelization of divide and conquer algorithms , 1999, PPoPP '99.

[14]  Richard E. Ladner,et al.  The influence of caches on the performance of sorting , 1997, SODA '97.