A quantitative analysis of the speedup factors of FPGAs over processors

The speedup over a microprocessor that can be achieved by implementing some programs on an FPGA has been extensively reported. This paper presents an analysis, both quantitative and qualitative, at the architecture level of the components of this speedup. Obviously, the spatial parallelism that can be exploited on the FPGA is a big component. By itself, however, it does not account for the whole speedup.In this paper we experimentally analyze the remaining components of the speedup. We compare the performance of image processing application programs executing in hardware on a Xilinx Virtex E2000 FPGA to that on three general-purpose processor platforms: MIPS, Pentium III and VLIW. The question we set out to answer is what is the inherent advantage of a hardware implementation over a von Neumann platform. On the one hand, the clock frequency of general-purpose processors is about 20 times that of typical FPGA implementations. On the other hand, the iteration level parallelism on the FPGA is one to two orders of magnitude that on the CPUs. In addition to these two factors, we identify the efficiency advantage of FPGAs as an important factor and show that it ranges from 6 to 47 on our test benchmarks. We also identify some of the components of this factor: the streaming of data from memory, the overlap of control and data flow and the elimination of some instruction on the FPGA. The results provide a deeper understanding of the tradeoff between system complexity and performance when designing Configurable SoC as well as designing software for CSoC. They also help understand the one to two orders of magnitude in speedup of FPGAs over CPU after accounting for clock frequencies.

[1]  Dominique Lavenier,et al.  Evaluation of the streams-C C-to-FPGA compiler: an applications perspective , 2001, FPGA '01.

[2]  Francisco Cardells-Tormo,et al.  Efficient FPGA-based QPSK Demodulation Loops: Application to the DVB Standard , 2002, FPL.

[3]  Frank Vahid,et al.  Improving Software Performance with Configurable Logic , 2002, Des. Autom. Embed. Syst..

[4]  Bruce A. Draper,et al.  Compiling ATR probing codes for execution on FPGA hardware , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[5]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[6]  Laurent Moll,et al.  Systems performance measurement on PCI Pamette , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[7]  Frank Vahid,et al.  Dynamic hardware/software partitioning: a first approach , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[8]  Yamin Li,et al.  A new non-restoring square root algorithm and its VLSI implementations , 1996, Proceedings International Conference on Computer Design. VLSI in Computers and Processors.

[9]  André DeHon,et al.  The Density Advantage of Configurable Computing , 2000, Computer.

[10]  Gordon J. Brebner Single-chip gigabit mixed-version IP router on Virtex-II Pro , 2002, Proceedings. 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.