论文信息 - The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation - 字舞流文

The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.

Vijaya Ramachandran | Rezaul Alam Chowdhury | V. Ramachandran | R. Chowdhury

[1] Guy E. Blelloch,et al. The data locality of work stealing , 2000, SPAA.

[2] Alexandru Nicolau,et al. R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks , 2007, Algorithmica.

[3] Guy E. Blelloch,et al. Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[4] R. Ladner,et al. Cache efficient simple dynamic programming , 2005 .

[5] Matteo Frigo,et al. Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6] Yuefan Deng,et al. New trends in high performance computing , 2001, Parallel Computing.

[7] Matteo Frigo,et al. The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[8] Richard E. Ladner,et al. Algorithms to Take Advantage of Hardware Prefetching , 2007, ALENEX.

[9] Matteo Frigo,et al. An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[10] Volker Strumpen,et al. The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[11] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[12] Kenneth E. Iverson,et al. A programming language , 1899, AIEE-IRE '62 (Spring).

[13] Vijaya Ramachandran,et al. Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[14] Vijaya Ramachandran,et al. The cache-oblivious gaussian elimination paradigm: theoretical framework and experimental evaluation , 2006, SPAA '06.

[15] Stephen Warshall,et al. A Theorem on Boolean Matrices , 1962, JACM.

[16] Viktor K. Prasanna,et al. Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[17] Vijaya Ramachandran,et al. Cache-oblivious dynamic programming , 2006, SODA '06.

[18] David S. Greenberg,et al. Beyond core: Making parallel computer I/O practical , 1993 .

[19] Alfred V. Aho,et al. The Design and Analysis of Computer Algorithms , 1974 .

[20] Guy E. Blelloch,et al. Effectively sharing a cache among threads , 2004, SPAA '04.

[21] Donald E. Knuth. Two notes on notation , 1992 .

[22] Sivan Toledo. Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[23] Steven S. Muchnick,et al. Advanced Compiler Design and Implementation , 1997 .

[24] Robert A. van de Geijn,et al. Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[25] Josef Weidendorfer,et al. Valgrind 3.3 - Advanced Debugging and Profiling for Gnu/Linux Applications , 2008 .

[26] Roman Dementiev,et al. STXXL: standard template library for XXL data sets , 2008 .

[27] Mithuna Thottethodi,et al. Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[28] Robert A. van de Geijn,et al. FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[29] Keshav Pingali,et al. An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[30] Alok Aggarwal,et al. The input/output complexity of sorting and related problems , 1988, CACM.

[31] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[32] H. T. Kung,et al. I/O complexity: The red-blue pebble game , 1981, STOC '81.