Traversal Caches: A Framework for FPGA Acceleration of Pointer Data Structures

Field-programmable gate arrays (FPGAs) and other reconfigurable computing (RC) devices have been widely shown to have numerous advantages including order of magnitude performance and power improvements compared to microprocessors for some applications. Unfortunately, FPGA usage has largely been limited to applications exhibiting sequential memory access patterns, thereby prohibiting acceleration of important applications with irregular patterns (e.g., pointer-based data structures). In this paper, we present a design pattern for RC application development that serializes irregular data structure traversals online into a traversal cache, which allows the corresponding data to be efficiently streamed to the FPGA. The paper presents a generalized framework that benefits applications with repeated traversals, which we show can achieve between 7x and 29x speedup over pointer-based software. For applications without strictly repeated traversals, we present application-specialized extensions that benefit applications with highly similar traversals by exploiting similarity to improve memory bandwidth and execute multiple traversals in parallel. We show that these extensions can achieve a speedup between 11x and 70x on a Virtex4 LX100 for Barnes-Hut n-body simulation.

[1]  A. George,et al.  Computational Density of Fixed and Reconfigurable Multi-Core Devices for Application Acceleration , 2008 .

[2]  Maxim Shevtsov,et al.  Highly Parallel Fast KD‐tree Construction for Interactive Ray Tracing of Dynamic Scenes , 2007, Comput. Graph. Forum.

[3]  Pedro C. Diniz,et al.  A compiler approach to managing storage and memory bandwidth in configurable architectures , 2008, TODE.

[4]  James Coole,et al.  Traversal caches: a first step towards FPGA acceleration of pointer-based data structures , 2008, CODES+ISSS '08.

[5]  Nikil D. Dutt,et al.  Access pattern based local memory customization for low power embedded systems , 2001, Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001.

[6]  James Coole,et al.  A Traversal Cache Framework for FPGA Acceleration of Pointer Data Structures: A Case Study on Barnes-Hut N-body Simulation , 2009, 2009 International Conference on Reconfigurable Computing and FPGAs.

[7]  Pedro C. Diniz,et al.  Data search and reorganization using FPGAs: application to spatial pointer-based data structures , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[8]  Walid A. Najjar,et al.  Input data reuse in compiling window operations onto reconfigurable hardware , 2004, LCTES '04.

[9]  Chandra Krintz,et al.  Cache-conscious data placement , 1998, ASPLOS VIII.

[10]  Nikil D. Dutt,et al.  Memory aware compilation through accurate timing extraction , 2000, Proceedings 37th Design Automation Conference.

[11]  Frank Vahid,et al.  A quantitative analysis of the speedup factors of FPGAs over processors , 2004, FPGA '04.

[12]  G. De Micheli,et al.  SpC: synthesis of pointers in C application of pointer analysis to the behavioral synthesis from C , 1998, 1998 IEEE/ACM International Conference on Computer-Aided Design. Digest of Technical Papers (IEEE Cat. No.98CB36287).

[13]  Erik Brockmeyer,et al.  Data and memory optimization techniques for embedded systems , 2001, TODE.

[14]  André DeHon,et al.  The Density Advantage of Configurable Computing , 2000, Computer.

[15]  David Nagle,et al.  Dynamic elimination of pointer-expressions , 1998, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192).

[16]  James R. Larus,et al.  Making Pointer-Based Data Structures Cache Conscious , 2000, Computer.

[17]  Zhen Fang,et al.  The Impulse Memory Controller , 2001, IEEE Trans. Computers.

[18]  Paul Chow,et al.  Memory interfacing and instruction specification for reconfigurable processors , 1999, FPGA '99.

[19]  M. Martonosi,et al.  Timekeeping in the memory system: predicting and optimizing memory behavior , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[20]  Alan D. George,et al.  RAT: a methodology for predicting performance in application design migration to FPGAs , 2007, HPRCTA.

[21]  Giovanni De Micheli,et al.  Synthesis of hardware models in C with pointers and complex data structures , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[22]  Brad Calder,et al.  Pointer cache assisted prefetching , 2002, 35th Annual IEEE/ACM International Symposium on Microarchitecture, 2002. (MICRO-35). Proceedings..

[23]  Ananth Grama,et al.  Scalable parallel formulations of the Barnes-Hut method for n-body simulations , 1994, Proceedings of Supercomputing '94.

[24]  Pedro C. Diniz,et al.  A compiler approach to fast hardware design space exploration in FPGA-based systems , 2002, PLDI '02.