Optimal Cache-Oblivious Mesh Layouts

A mesh is a graph that divides physical space into regularly-shaped regions. Meshes computations form the basis of many applications, including finite-element methods, image rendering, collision detection, and N-body simulations. In one important mesh primitive, called a mesh update, each mesh vertex stores a value and repeatedly updates this value based on the values stored in all neighboring vertices. The performance of a mesh update depends on the layout of the mesh in memory. Informally, if the mesh layout has good data locality (most edges connect a pair of nodes that are stored near each other in memory), then a mesh update runs quickly.This paper shows how to find a memory layout that guarantees that the mesh update has asymptotically optimal memory performance for any set of memory parameters. Specifically, the cost of the mesh update is roughly the cost of a sequential memory scan. Such a memory layout is called cache-oblivious. Formally, for a d-dimensional mesh G, block size B, and cache size M (where M=Ω(Bd)), the mesh update of G uses O(1+|G|/B) memory transfers. The paper also shows how the mesh-update performance degrades for smaller caches, where M=o(Bd).The paper then gives two algorithms for finding cache-oblivious mesh layouts. The first layout algorithm runs in time O(|G|log 2|G|) both in expectation and with high probability on a RAM. It uses O(1+|G|log 2(|G|/M)/B) memory transfers in expectation and O(1+(|G|/B)(log 2(|G|/M)+log |G|)) memory transfers with high probability in the cache-oblivious and disk-access machine (DAM) models. The layout is obtained by finding a fully balanced decomposition tree of G and then performing an in-order traversal of the leaves of the tree.The second algorithm computes a cache-oblivious layout on a RAM in time O(|G|log |G|log log |G|) both in expectation and with high probability. In the DAM and cache-oblivious models, the second layout algorithm uses O(1+(|G|/B) log (|G|/M)min {log log |G|,log (|G|/M)}) memory transfers in expectation and O(1+(|G|/B)(log (|G|/M)min {log log |G|,log (|G|/M)}+log |G|)) memory transfers with high probability. The algorithm is based on a new type of decomposition tree, here called a relax-balanced decomposition tree. Again, the layout is obtained by performing an in-order traversal of the leaves of the decomposition tree.

[1]  Shang-Hua Teng,et al.  Provably Good Partitioning and Load Balancing Algorithms for Parallel Adaptive N-Body Simulation , 1998, SIAM J. Sci. Comput..

[2]  Vijaya Ramachandran,et al.  External-memory exact and approximate all-pairs shortest-paths in undirected graphs , 2005, SODA '05.

[3]  Steven G. Johnson,et al.  FFTW: an adaptive software architecture for the FFT , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[4]  Ulrich Meyer,et al.  Cache-Oblivious Data Structures and Algorithms for Undirected Breadth-First Search and Shortest Paths , 2004, SWAT.

[5]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[6]  Suresh Venkatasubramanian,et al.  On external memory graph traversal , 2000, SODA '00.

[7]  Bruce Hendrickson,et al.  The Chaco user`s guide. Version 1.0 , 1993 .

[8]  S. Vavasis,et al.  Geometric Separators for Finite-Element Meshes , 1998, SIAM J. Sci. Comput..

[9]  Jeffrey Scott Vitter,et al.  Optimal dynamic interval management in external memory , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[10]  S. F. Masri,et al.  Identification of the internal forces of structural systems using feedforward multilayer networks , 1991 .

[11]  Michael A. Bender,et al.  Optimal Sparse Matrix Dense Vector Multiplication in the I/O-Model , 2007, SPAA '07.

[12]  Sivan Toledo,et al.  Quantitative performance modeling of scientific computations and creating locality in numerical algorithms , 1995 .

[13]  Leslie Greengard,et al.  A fast algorithm for particle simulations , 1987 .

[14]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[15]  Gerhard Weikum,et al.  Self-tuning Database Technology and Information Services: from Wishful Thinking to Viable Engineering , 2002, VLDB.

[16]  D. Manocha,et al.  Cache-oblivious mesh layouts , 2005, ACM Trans. Graph..

[17]  Ulrich Meyer,et al.  Cache-Oblivious Data Structures and Algorithms for Undirected Breadth-First Search and Shortest Paths , 2004 .

[18]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[19]  Sivan Toledo Locality of Reference in LU Decomposition with Partial Pivoting , 1997, SIAM J. Matrix Anal. Appl..

[20]  Vipin Kumar,et al.  A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[21]  Shang-Hua Teng,et al.  Min-max-boundary domain decomposition , 1998, Theor. Comput. Sci..

[22]  Ulrich Meyer,et al.  External Memory Algorithms for Diameter and All-Pairs Shortest-Paths on Sparse Graphs , 2004, ICALP.

[23]  Victor Eijkhout,et al.  Self-Adapting Linear Algebra Algorithms and Software , 2005, Proceedings of the IEEE.

[24]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[25]  Shang-Hua Teng,et al.  Unstructured Mesh Generation: Theory, Practice, and Perspectives , 2000, Int. J. Comput. Geom. Appl..

[26]  Frank Thomson Leighton,et al.  A Framework for Solving VLSI Graph Layout Problems , 1983, J. Comput. Syst. Sci..

[27]  Lars Arge,et al.  On external-memory MST, SSSP and multi-way planar graph separation , 2000, J. Algorithms.

[28]  Gerth Stølting Brodal,et al.  Engineering a cache-oblivious sorting algorithm , 2008, JEAL.

[29]  Michael A. Bender,et al.  An Optimal Cache-Oblivious Priority Queue and Its Application to Graph Algorithms , 2007, SIAM J. Comput..

[30]  Gerth Stølting Brodal,et al.  Cache Oblivious Distribution Sweeping , 2002, ICALP.

[31]  Katherine Yelick,et al.  The Optimized Sparse Kernel Interface (OSKI) Library User's Guide for Version 1.0.1h , 2007 .

[32]  Ulrich Meyer External memory BFS on undirected graphs with bounded degree , 2001, SODA '01.

[33]  R W Hockney,et al.  Computer Simulation Using Particles , 1966 .

[34]  Kurt Mehlhorn,et al.  External-Memory Breadth-First Search with Sublinear I/O , 2002, ESA.

[35]  Jeffery R. Westbrook,et al.  A Functional Approach to External Graph Algorithms , 1998, Algorithmica.

[36]  Horst D. Simon,et al.  Partitioning of unstructured problems for parallel processing , 1991 .

[37]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[38]  Charles H. Goldberg,et al.  Bisection of Circle Colorings , 1985 .

[39]  Edward F. Grove,et al.  External-memory graph algorithms , 1995, SODA '95.

[40]  Kamesh Munagala,et al.  I/O-complexity of graph algorithms , 1999, SODA '99.

[41]  Frank Thomson Leighton A layout strategy for VLSI which is provably good (Extended Abstract) , 1982, STOC '82.