Automatic vectorization of tree traversals

Repeated tree traversals are ubiquitous in many domains such as scientific simulation, data mining and graphics. Modern commodity processors support SIMD instructions, and using these instructions to process multiple traversals at once has the potential to provide substantial performance improvements. Unfortunately these algorithms often feature highly diverging traversals which inhibit efficient SIMD utilization, to the point that other, less profitable sources of vectorization must be exploited instead. Previous work has proposed traversal splicing, a locality transformation for tree traversals, which dynamically reorders traversals based on previous behavior, based on the insight that traversals which have behaved similarly so far are likely to behave similarly in the future. In this work, we cast this dynamic reordering as a scheduling for efficient SIMD execution, and show that it can dramatically improve the SIMD utilization of diverging traversals, close to ideal utilization. For five irregular tree traversal algorithms, our techniques are able to deliver speedups of 2.78 on average over baseline implementations. Furthermore our techniques can effectively SIMDize algorithms that prior, manual vectorization attempts could not.

[1]  Makoto Onizuka,et al.  VAST-Tree: a vector-advanced and compressed structure for massive data tree traversal , 2012, EDBT '12.

[2]  Markus Wagner,et al.  Interactive Rendering with Coherent Ray Tracing , 2001, Comput. Graph. Forum.

[3]  Nina Byers Report From The Chair , 2005 .

[4]  Seonggun Kim,et al.  Efficient SIMD code generation for irregular kernels , 2012, PPoPP '12.

[5]  Henrik Wann Jensen,et al.  Global Illumination using Photon Maps , 1996, Rendering Techniques.

[6]  Keshav Pingali,et al.  Lonestar: A suite of parallel irregular programs , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[7]  Alexander Keller,et al.  Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays , 2008, Comput. Graph. Forum.

[8]  Sumit Gulwani,et al.  From relational verification to SIMD loop synthesis , 2013, PPoPP '13.

[9]  Ayal Zaks,et al.  Outer-loop vectorization - revisited for short SIMD architectures , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Jon Louis Bentley,et al.  Multidimensional binary search trees used for associative searching , 1975, CACM.

[11]  Peter N. Yianilos,et al.  Data structures and algorithms for nearest neighbor search in general metric spaces , 1993, SODA '93.

[12]  Kellogg S. Booth,et al.  Report from the chair , 1986 .

[13]  Milind Kulkarni,et al.  Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.

[14]  Hye-Sun Kim,et al.  Cache-oblivious ray reordering , 2010, TOGS.

[15]  Junichiro Makino,et al.  Vectorization of a treecode , 1990 .

[16]  Adam Herout,et al.  Yet Faster Ray-Triangle Intersection (Using SSE4) , 2010, IEEE Transactions on Visualization and Computer Graphics.

[17]  Milind Kulkarni,et al.  Automatically enhancing locality for tree traversals with traversal splicing , 2012, OOPSLA '12.

[18]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[19]  David A. Padua,et al.  An Evaluation of Vectorizing Compilers , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[20]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[21]  James R. Larus,et al.  SIMD parallelization of applications that traverse irregular data structures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[22]  P.H. Christensen,et al.  Ray Tracing for the Movie `Cars' , 2006, 2006 IEEE Symposium on Interactive Ray Tracing.

[23]  John Shalf,et al.  Billion-particle SIMD-friendly two-point correlation on large-scale HPC cluster systems , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[24]  Reinhard Wilhelm,et al.  Parametric shape analysis via 3-valued logic , 1999, POPL '99.

[25]  L. Hernquist,et al.  Vectorization of tree traversals , 1990 .