Soft vector processors with streaming pipelines

Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.

[1]  Guy Lemieux,et al.  VENICE: A compact vector processor for FPGA applications , 2012, 2012 International Conference on Field-Programmable Technology.

[2]  Christoforos E. Kozyrakis,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, MICRO.

[3]  Jonathan Rose,et al.  VESPA: portable, scalable, and flexible FPGA-based vector processors , 2008, CASES '08.

[4]  Guy Lemieux,et al.  Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[5]  Jason Helge Anderson,et al.  LegUp: high-level synthesis for FPGA-based processor/accelerator systems , 2011, FPGA '11.

[6]  Guy Lemieux,et al.  Vector Processing as a Soft Processor Accelerator , 2009, TRETS.

[7]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[8]  Miriam Leeser,et al.  An Autonomous Vector/Scalar Floating Point Coprocessor for FPGAs , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[9]  Jason Cong,et al.  Compilation and architecture support for customized vector instruction extension , 2012, 17th Asia and South Pacific Design Automation Conference.

[10]  Ataru Tanikawa,et al.  Phantom-GRAPE: Numerical software library to accelerate collisionless N-body simulation with SIMD instruction set on x86 architecture , 2012, 1203.4037.

[11]  Jonathan Rose,et al.  Fine-grain performance scaling of soft vector processors , 2009, CASES '09.

[12]  Guy Lemieux,et al.  VEGAS: soft vector processor with scratchpad memory , 2011, FPGA '11.