SIMD parallelization of applications that traverse irregular data structures

Fine-grained data parallelism is increasingly common in mainstream processors in the form of longer vectors and on-chip GPUs. This paper develops support for exploiting such data parallelism for a class of non-numeric, non-graphic applications, which perform computations while traversing many independent, irregular data structures. While the traversal of any one irregular data structure does not give opportunity for parallelization, traversing a set of these does. However, mapping such parallelism to SIMD units is nontrivial and not addressed in prior work. We address this problem by developing an intermediate language for specifying such traversals, followed by a run-time scheduler that maps traversals to SIMD units. A key idea in our run-time scheme is converting branches to arithmetic operations, which then allows us to use SIMD hardware. In order to make our approach fast, we demonstrate several optimizations including a stream compaction method that aids with control flow in SIMD, a set of layouts that reduce memory latency, and a tiling approach that enables more effective prefetching. Using our approach, we demonstrate significant increases in single-core performance over optimized baselines for two applications.

[1]  Milind Kulkarni,et al.  Enhancing locality for recursive traversals of recursive structures , 2011, OOPSLA '11.

[2]  Jiawei Han,et al.  Frequent pattern mining: current status and future directions , 2007, Data Mining and Knowledge Discovery.

[3]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[4]  Guy E. Blelloch,et al.  Scan primitives for vector computers , 1990, Proceedings SUPERCOMPUTING '90.

[5]  Ken Kennedy,et al.  Relaxing SIMD control flow constraints using loop transformations , 1992, PLDI '92.

[6]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[7]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[8]  Guilherme Ottoni,et al.  Automatic thread extraction with decoupled software pipelining , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[9]  Kunle Olukotun,et al.  Efficient Parallel Graph Exploration on Multi-Core CPU and GPU , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[10]  Gregory F. Russell,et al.  High-performance regular expression scanning on the Cell/B.E. processor , 2009, ICS '09.

[11]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[12]  Keshav Pingali,et al.  How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[13]  Jiawei Han,et al.  gSpan: graph-based substructure pattern mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[14]  Junichiro Makino,et al.  Vectorization of a treecode , 1990 .

[15]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[16]  Laurie J. Hendren,et al.  Detecting Parallelism in C Programs with Recursive Darta Structures , 1998, CC.

[17]  Torbjørn Rognes,et al.  Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors , 2000, Bioinform..

[18]  H. G. Dietz,et al.  A Massively Parallel MIMD Implemented by SIMD Hardware , 1992 .

[19]  Daniel W. Palmer,et al.  Transforming high-level data-parallel programs into vector operations , 1993, PPOPP '93.

[20]  Niccolo Cascarano,et al.  iNFAnt: NFA pattern matching on GPGPU devices , 2010, CCRV.

[21]  T. Sato,et al.  2.44-GFLOPS 300-MHz floating-point vector-processing unit for high-performance 3D graphics computing , 2000, IEEE Journal of Solid-State Circuits.

[22]  Ming Yang,et al.  GPU-based NFA implementation for memory efficient high speed regular expression matching , 2012, PPoPP '12.

[23]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[24]  Francisco Tirado,et al.  Vectorization of multigrid codes using SIMD ISA extensions , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[25]  Pradeep Dubey,et al.  PALM: Parallel Architecture-Friendly Latch-Free Modifications to B+ Trees on Many-Core Processors , 2011, Proc. VLDB Endow..

[26]  XML parsing: a threat to database performance , 2003, CIKM '03.

[27]  Keshav Pingali,et al.  Structure-driven optimizations for amorphous data-parallel programs , 2010, PPoPP '10.

[28]  Pradeep Dubey,et al.  FAST: fast architecture sensitive tree search on modern CPUs and GPUs , 2010, SIGMOD Conference.

[29]  Sotiris Ioannidis,et al.  Regular Expression Matching on Graphics Hardware for Intrusion Detection , 2009, RAID.

[30]  Gu-Yeon Wei,et al.  HELIX: automatic parallelization of irregular programs for chip multiprocessing , 2012, CGO '12.

[31]  Andrew S. Grimshaw,et al.  Scalable GPU graph traversal , 2012, PPoPP '12.

[32]  Ken Thompson,et al.  Programming Techniques: Regular expression search algorithm , 1968, Commun. ACM.

[33]  Jonathan C. Hardwick,et al.  An Efficient Implementation of Nested Data Parallelism for Irregular Divide-and-Conquer Algorithms , 1996 .

[34]  L. Hernquist,et al.  Vectorization of tree traversals , 1990 .