论文信息 - A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems

The advent of FPGA acceleration platforms with direct coherent access to processor memory creates an opportunity for accelerating applications with irregular parallelism governed by large in-memory pointer-based data structures. This paper uses the simple reference behavior of a linked-list traversal as a proxy to study the performance potentials of accelerating these applications on shared-memory processor-FPGA systems. The linked-list traversal is parameterized by node layout in memory, per-node data payload size, payload dependence, and traversal concurrency to capture the main performance effects of different pointer-based data structures and algorithms. The paper explores the trade-offs over a wide range of implementation options available on shared-memory processor-FPGA architectures, including using tightly-coupled processor assistance. We make observations of the key effects on currently available systems including the Xilinx Zynq, the Intel QuickAssist QPI FPGA Platform, and the Convey HC-2. The key results show: (1) the FPGA fabric is least efficient when traversing a single list with non-sequential node layout and a small payload size; (2) processor assistance can help alleviate this shortcoming; and (3) when appropriate, a fabric only approach that interleaves multiple linked list traversals is an effective way to maximize traversal performance.

[1] Karin Strauss,et al. Accelerating Deep Convolutional Neural Networks Using Specialized Hardware , 2015 .

[2] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[3] Rishiyur S. Nikhil,et al. Bluespec System Verilog: efficient, correct RTL from high level specifications , 2004, Proceedings. Second ACM and IEEE International Conference on Formal Methods and Models for Co-Design, 2004. MEMOCODE '04..

[4] Magnus Jahre,et al. Hybrid breadth-first search on a single-chip FPGA-CPU heterogeneous platform , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[5] Ronald L. Rivest,et al. Introduction to Algorithms, third edition , 2009 .

[6] Elkin Garcia,et al. A Reconfigurable Computing System Based on a Cache-Coherent Fabric , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[7] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.

[8] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[9] Tara N. Sainath,et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[10] Louise H. Crockett,et al. The Zynq Book: Embedded Processing with the Arm Cortex-A9 on the Xilinx Zynq-7000 All Programmable Soc , 2014 .

[11] Rob A. Rutenbar,et al. Fast hierarchical implementation of sequential tree-reweighted belief propagation for probabilistic inference , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[12] Jürgen Teich,et al. On-the-fly Composition of FPGA-Based SQL Query Accelerators Using a Partially Reconfigurable Module Library , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[13] Li Fei-Fei,et al. ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[14] Hari Angepat,et al. An FPGA-based In-Line Accelerator for Memcached , 2014, IEEE Computer Architecture Letters.

[15] James C. Hoe,et al. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[16] Martin Margala,et al. An FPGA memcached appliance , 2013, FPGA '13.