Automatically enhancing locality for tree traversals with traversal splicing

Generally applicable techniques for improving temporal locality in irregular programs, which operate over pointer-based data structures such as trees and graphs, are scarce. Focusing on a subset of irregular programs, namely, tree traversal algorithms like Barnes-Hut and nearest neighbor, previous work has proposed point blocking, a technique analogous to loop tiling in regular programs, to improve locality. However point blocking is highly dependent on point sorting, a technique to reorder points so that consecutive points will have similar traversals. Performing this a priori sort requires an understanding of the semantics of the algorithm and hence highly application specific techniques. In this work, we propose traversal splicing, a new, general, automatic locality optimization for irregular tree traversal codes, that is less sensitive to point order, and hence can deliver substantially better performance, even in the absence of semantic information. For six benchmark algorithms, we show that traversal splicing can deliver single-thread speedups of up to 9.147 (geometric mean: 3.095) over baseline implementations, and up to 4.752 (geometric mean: 2.079) over point-blocked implementations. Further, we show that in many cases, automatically applying traversal splicing to a baseline implementation yields performance that is better than carefully hand-optimized implementations.

[1]  Tero Karras,et al.  Architecture considerations for tracing incoherent rays , 2010, HPG '10.

[2]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[3]  Paul A. Navrátil,et al.  Memory-efficient, scalable ray tracing , 2010 .

[4]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[5]  Sally A. McKee,et al.  Computation regrouping: restructuring programs for temporal data cache locality , 2002, ICS '02.

[6]  Michael A. Greenspan,et al.  Approximate k-d tree search for efficient ICP , 2003, Fourth International Conference on 3-D Digital Imaging and Modeling, 2003. 3DIM 2003. Proceedings..

[7]  B. Walter,et al.  Fast agglomerative clustering for rendering , 2008, 2008 IEEE Symposium on Interactive Ray Tracing.

[8]  James R. Larus,et al.  Cache-conscious structure definition , 1999, PLDI '99.

[9]  Laurie J. Hendren,et al.  Detecting Parallelism in C Programs with Recursive Darta Structures , 1998, CC.

[10]  Torbjörn Ekman,et al.  The jastadd extensible java compiler , 2007, OOPSLA.

[11]  Laurie J. Hendren,et al.  Is it a tree, a DAG, or a cyclic graph? A shape analysis for heap-directed pointers in C , 1996, POPL '96.

[12]  Milind Kulkarni,et al.  Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.

[13]  Stanley B. Zdonik,et al.  A*-tree , 2010, Proc. VLDB Endow..

[14]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[15]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[16]  Stephen M. Omohundro,et al.  Five Balltree Construction Algorithms , 2009 .

[17]  Irving L. Traiger,et al.  Evaluation Techniques for Storage Hierarchies , 1970, IBM Syst. J..

[18]  Jack Dongarra,et al.  Using PAPI for Hardware Performance Monitoring on Linux Systems , 2001 .

[19]  E. Mansson,et al.  Deep Coherent Ray Tracing , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.

[20]  Leo A. Meyerovich,et al.  Data Parallel Programming for Irregular Tree Computations , 2011 .

[21]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[22]  Hye-Sun Kim,et al.  Cache-oblivious ray reordering , 2010, TOGS.

[23]  Vikram S. Adve,et al.  Automatic pool allocation: improving performance by controlling data structure layout in the heap , 2005, PLDI '05.

[24]  Lieven Eeckhout,et al.  Statistically rigorous java performance evaluation , 2007, OOPSLA.

[25]  Alexandru Nicolau,et al.  Parallelizing Programs with Recursive Data Structures , 1989, IEEE Trans. Parallel Distributed Syst..

[26]  Pat Hanrahan,et al.  Rendering complex scenes with memory-coherent ray tracing , 1997, SIGGRAPH.

[27]  Andrew W. Moore,et al.  'N-Body' Problems in Statistical Learning , 2000, NIPS.

[28]  Reinhard Wilhelm,et al.  Parametric shape analysis via 3-valued logic , 1999, POPL '99.

[29]  François Bodin,et al.  Improving cache behavior of dynamically allocated data structures , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[30]  James R. Larus,et al.  Using generational garbage collection to implement cache-conscious data placement , 1998, ISMM '98.

[31]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[33]  S. Canu,et al.  Training Invariant Support Vector Machines using Selective Sampling , 2005 .

[34]  Pen-Chung Yew,et al.  On improving heap memory layout by dynamic pool allocation , 2010, CGO '10.

[35]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[36]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis technique for parallelizing compilers , 1997, TOPL.

[37]  P.A. Navratil,et al.  Dynamic Ray Scheduling to Improve Ray Coherence and Bandwidth Utilization , 2007, 2007 IEEE Symposium on Interactive Ray Tracing.

[38]  Larry Carter,et al.  Rescheduling for Locality in Sparse Matrix Computations , 2001, International Conference on Computational Science.

[39]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[40]  Francisco Argüello,et al.  A Data Parallel Formulation of the Barnes-Hut Method for N -Body Simulations , 2000, PARA.

[41]  Anoop Gupta,et al.  Load Balancing and Data locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Rasiosity , 1995, J. Parallel Distributed Comput..