Optimizing graph algorithms for improved cache performance

We develop algorithmic optimizations to improve the cache performance of four fundamental graph algorithms. We present a cache-oblivious implementation of the Floyd-Warshall algorithm for the fundamental graph problem of all-pairs shortest paths by relaxing some dependencies in the iterative version. We show that this implementation achieves the lower bound on processor-memory traffic of /spl Omega/(N/sup 3///spl radic/C), where N and C are the problem size and cache size, respectively. Experimental results show that this cache-oblivious implementation shows more than six times the improvement in real execution time over that of the iterative implementation with the usual row major data layout, on three state-of-the-art architectures. Second, we address Dijkstra's algorithm for the single-source shortest paths problem and Prim's algorithm for minimum spanning tree problem. For these algorithms, we demonstrate up to two times the improvement in real execution time by using a simple cache-friendly graph representation, namely adjacency arrays. Finally, we address the matching algorithm for bipartite graphs. We show performance improvements of two to three times in real execution time by using the technique of making the algorithm initially work on subproblems to generate a suboptimal solution and, then, solving the whole problem using the suboptimal solution as a starting point. Experimental results are shown for the Pentium III, UltraSPARC III, Alpha 21264, and MIPS R12000 machines.

[1]  Sandeep Sen,et al.  Towards a theory of cache-efficient algorithms , 2000, SODA '00.

[2]  Alfred V. Aho,et al.  The Design and Analysis of Computer Algorithms , 1974 .

[3]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[4]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[5]  Viktor K. Prasanna,et al.  Dynamic data layouts for cache-conscious factorization of DFT , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[6]  Viktor K. Prasanna,et al.  Cache conscious Walsh-Hadamard transform , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[7]  Ellis Horowitz,et al.  Fundamentals of Computer Algorithms , 1978 .

[8]  David A. Patterson,et al.  Computer architecture (2nd ed.): a quantitative approach , 1996 .

[9]  Peter Sanders,et al.  Fast priority queues for cached memory , 1999, JEAL.

[10]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[11]  Sartaj Sahni,et al.  A Blocked All-Pairs Shortest-Path Algorithm , 2000, SWAT.

[12]  Sartaj Sahni,et al.  A blocked all-pairs shortest-paths algorithm , 2003, ACM J. Exp. Algorithmics.

[13]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[14]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[15]  Siddhartha Chatterjee,et al.  Cache-efficient matrix transposition , 2000, Proceedings Sixth International Symposium on High-Performance Computer Architecture. HPCA-6 (Cat. No.PR00550).

[16]  Sartaj Sahni,et al.  Data Structures, Algorithms and Applications in Java , 1998 .

[17]  Sabih H. Gerez,et al.  Algorithms for VLSI design automation , 1998 .

[18]  Peter M. Kogge,et al.  The Characterization of Data Intensive Memory Workloads on Distributed PIM Systems , 2000, Intelligent Memory Systems.

[19]  Wilson C. Hsieh,et al.  Impulse: Memory system support for scientific applications , 1999, Sci. Program..

[20]  Michael Brenner,et al.  Multiagent Planning with Partially Ordered Temporal Plans , 2003, IJCAI.

[21]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[22]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[23]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[24]  Sally A. McKee,et al.  Caches as filters: a new approach to cache analysis , 1998, Proceedings. Sixth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.98TB100247).

[25]  Sunita Sarawagi,et al.  On computing the data cube , 1996 .

[26]  Christos H. Papadimitriou,et al.  On the Floyd-Warshall Algorithm for Logic Programs , 1999, J. Log. Program..

[27]  Mateo Valero,et al.  Eliminating cache conflict misses through XOR-based placement functions , 1997, ICS '97.

[28]  Mehryar Mohri,et al.  A weight pushing algorithm for large vocabulary speech recognition , 2001, INTERSPEECH.

[29]  Miodrag Potkonjak,et al.  Exposure in wireless Ad-Hoc sensor networks , 2001, MobiCom '01.

[30]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[31]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[32]  Peter J. Varman,et al.  Optimal prefetching and caching for parallel I/O sytems , 2001, SPAA '01.

[33]  H. T. Kung,et al.  I/O complexity: The red-blue pebble game , 1981, STOC '81.

[34]  Yves Robert,et al.  Loop partitioning versus tiling for cache-based multiprocessors , 1998 .

[35]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[36]  Joon-Sang Park,et al.  Optimizing graph algorithms for improved cache performance , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[37]  Nikil D. Dutt,et al.  Memory data organization for improved cache performance in embedded processor applications , 1997, TODE.

[38]  Alex C. Mueller,et al.  The SPIRAL project , 1995 .

[39]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[40]  Chau-Wen Tseng,et al.  Data transformations for eliminating conflict misses , 1998, PLDI.

[41]  Richard E. Ladner,et al.  The influence of caches on the performance of heaps , 1996, JEAL.

[42]  Hai Jin,et al.  Parallel I/O Systems , 2002 .

[43]  M. Kanehisa,et al.  Extraction of correlated gene clusters by multiple graph comparison. , 2001, Genome informatics. International Conference on Genome Informatics.

[44]  Mihalis Yannakakis,et al.  Graph-theoretic methods in database theory , 1990, PODS.

[45]  Dimitri P. Bertsekas,et al.  Data Networks , 1986 .

[46]  Alok N. Choudhary Parallel I/O Systems - Guest Editor's Introduction , 1993, J. Parallel Distributed Comput..