论文信息 - Architecture-Cognizant Divide and Conquer Algorithms

Architecture-Cognizant Divide and Conquer Algorithms

Divide and conquer programs can achieve good performance on parallel computers and computers with deep memory hierarchies. We introduce architecture-cognizant divide and conquer algorithms, and explore how they can achieve even better performance. An architecture-cognizant algorithm has functionally-equivalent variants of the divide and/or combine functions, and a variant policy that specifies which variant to use at each level of recursion. An optimal variant policy is chosen for each target computer via experimentation. With h levels of recursion, an exhaustive search requires \theta(vh) experiments (where v is the number of variants). We present a method based on dynamic programming that reduces this to \theta(vc) (where c is typically a small constant) experiments for a class of architecture-cognizant programs. We verify our technique on two kernels (matrix multiply and 2-D Point Jacobi) using three architectures. Our technique improves performance by up to a factor of two, compared to architecture-oblivious divide and conquer implementations. Further our dynamic programming approach succeeds in selecting the optimal variant policy.

Larry Carter | Kang Su Gatlin | L. Carter | K. Gatlin

[1] Martin C. Rinard,et al. Automatic parallelization of divide and conquer algorithms , 1999, PPoPP '99.

[2] Matteo Frigo,et al. Portable high-performance programs , 1999 .

[3] Mithuna Thottethodi,et al. Tuning Strassen's Matrix Multiplication for Memory Efficiency , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[4] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[5] Steven G. Johnson,et al. The Fastest Fourier Transform in the West , 1997 .

[6] Richard E. Ladner,et al. The influence of caches on the performance of sorting , 1997, SODA '97.

[7] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[8] Matteo Frigo,et al. An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[9] Bowen Alpern,et al. Space-limited procedures: a methodology for portable high-performance , 1995, Programming Models for Massively Parallel Computers.

[10] Olivier Temam,et al. To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[11] Bowen Alpern,et al. Rectilinear Steiner Tree Minimization on a Workstation , 1992, Computational Support for Discrete Mathematics.

[12] JAMES DEMMEL,et al. LAPACK: A portable linear algebra library for high-performance computers , 1990, Proceedings SUPERCOMPUTING '90.

[13] Murray Cole,et al. Algorithmic skeletons : a structured approach to the management of parallel computation , 1988 .