The cache-oblivious gaussian elimination paradigm: theoretical framework, parallelization and experimental evaluation

The Gaussian Elimination Paradigm (GEP) was introduced by the authors in [6] to represent the triply-nested loop computation that occurs in several important algorithms including Gaussian elimination without pivoting and Floyd-Warshall's all-pairs shortest paths algorithm. An efficient cache-oblivious algorithm for these instances of GEP was presented in [6]. In this paper we establish several important properties of this cache-oblivious framework, and extend the framework to solve GEP in its full generality within the same time and I/O bounds. We then analyze a parallel implementation of the framework and its caching performance for both shared and distributed caches. We present extensive experimental results for both in-core and out-of-core performance of our algorithms. We consider both sequential and parallel implementations of our algorithms, and compare them with finely-tuned cache-aware BLAS code for matrix multiplication and Gaussian elimination without pivoting. Our results indicate that cache-oblivious GEP offers an attractive tradeoff between efficiency and portability.

[1]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[2]  Alexandru Nicolau,et al.  R-Kleene: A High-Performance Divide-and-Conquer Algorithm for the All-Pair Shortest Path for Densely Connected Networks , 2007, Algorithmica.

[3]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[4]  R. Ladner,et al.  Cache efficient simple dynamic programming , 2005 .

[5]  Matteo Frigo,et al.  Cache-oblivious algorithms , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[6]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[7]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[8]  Richard E. Ladner,et al.  Algorithms to Take Advantage of Hardware Prefetching , 2007, ALENEX.

[9]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[10]  Volker Strumpen,et al.  The Cache Complexity of Multithreaded Cache Oblivious Algorithms , 2009, SPAA '06.

[11]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[12]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[13]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[14]  Vijaya Ramachandran,et al.  The cache-oblivious gaussian elimination paradigm: theoretical framework and experimental evaluation , 2006, SPAA '06.

[15]  Stephen Warshall,et al.  A Theorem on Boolean Matrices , 1962, JACM.

[16]  Viktor K. Prasanna,et al.  Optimizing graph algorithms for improved cache performance , 2004, Proceedings 16th International Parallel and Distributed Processing Symposium.

[17]  Vijaya Ramachandran,et al.  Cache-oblivious dynamic programming , 2006, SODA '06.

[18]  David S. Greenberg,et al.  Beyond core: Making parallel computer I/O practical , 1993 .

[19]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[20]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[21]  Donald E. Knuth Two notes on notation , 1992 .

[22]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[23]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[24]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[25]  Josef Weidendorfer,et al.  Valgrind 3.3 - Advanced Debugging and Profiling for Gnu/Linux Applications , 2008 .

[26]  Roman Dementiev,et al.  STXXL: standard template library for XXL data sets , 2008 .

[27]  Mithuna Thottethodi,et al.  Recursive array layouts and fast parallel matrix multiplication , 1999, SPAA '99.

[28]  Robert A. van de Geijn,et al.  FLAME: Formal Linear Algebra Methods Environment , 2001, TOMS.

[29]  Keshav Pingali,et al.  An experimental comparison of cache-oblivious and cache-conscious programs , 2007, SPAA '07.

[30]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[31]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[32]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.