Warp Processors

We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66% across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.

[1]  Frank Vahid,et al.  Frequent loop detection using efficient nonintrusive on-chip hardware , 2005, IEEE Transactions on Computers.

[2]  Frank Vahid,et al.  Hardware/software partitioning of software binaries: a case study of H.264 decode , 2005, 2005 Third IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS'05).

[3]  Frank Vahid,et al.  New decompilation techniques for binary-level co-processor generation , 2005, ICCAD-2005. IEEE/ACM International Conference on Computer-Aided Design, 2005..

[4]  Frank Vahid,et al.  A study of the scalability of on-chip routing for just-in-time FPGA compilation , 2005, 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'05).

[5]  Kees A. Vissers,et al.  Optimized generation of data-path from C codes for FPGAs , 2005, Design, Automation and Test in Europe.

[6]  Frank Vahid,et al.  A study of the speedups and competitiveness of FPGA soft processor cores using dynamic hardware/software partitioning , 2005, Design, Automation and Test in Europe.

[7]  Kees A. Vissers,et al.  Programming models and architectures for FPGA platforms , 2004, CASES '04.

[8]  Prithviraj Banerjee,et al.  Automatic translation of software binaries onto FPGAs , 2004, Proceedings. 41st Design Automation Conference, 2004..

[9]  Frank Vahid,et al.  Dynamic FPGA routing for just-in-time FPGA compilation , 2004, Proceedings. 41st Design Automation Conference, 2004..

[10]  Wang Chen,et al.  An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm , 2004, FPGA '04.

[11]  John F. Keane,et al.  A compiled accelerator for biological cell signaling simulations , 2004, FPGA '04.

[12]  Frank Vahid,et al.  A configurable logic architecture for dynamic hardware/software partitioning , 2004, Proceedings Design, Automation and Test in Europe Conference and Exhibition.

[13]  Frank Vahid,et al.  Frequent loop detection using efficient non-intrusive on-chip hardware , 2003, CASES '03.

[14]  Frank Vahid,et al.  Dynamic hardware/software partitioning: a first approach , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[15]  Frank Vahid,et al.  On-chip logic minimization , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[16]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[17]  Frank Vahid,et al.  Hardware/software partitioning of software binaries , 2002, IEEE/ACM International Conference on Computer Aided Design, 2002. ICCAD 2002..

[18]  Bruce A. Draper,et al.  Mapping a Single Assignment Programming Language to Reconfigurable Systems , 2002, The Journal of Supercomputing.

[19]  Fadi J. Kurdahi,et al.  A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture , 2001, CASES '01.

[20]  Wendong Hu,et al.  NetBench: a benchmarking suite for network processors , 2001, IEEE/ACM International Conference on Computer Aided Design. ICCAD 2001. IEEE/ACM Digest of Technical Papers (Cat. No.01CH37281).

[21]  Gurindar S. Sohi,et al.  A programmable co-processor for profiling , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[22]  Bill Moyer,et al.  A low power unified cache architecture providing power and performance flexibility , 2000, ISLPED'00: Proceedings of the 2000 International Symposium on Low Power Electronics and Design (Cat. No.00TH8514).

[23]  Vaughn Betz,et al.  Speed and area tradeoffs in cluster-based FPGA architectures , 2000, IEEE Trans. Very Large Scale Integr. Syst..

[24]  P. Chow,et al.  The design of an SRAM-based field-programmable gate array. I. Architecture , 1999, IEEE Trans. Very Large Scale Integr. Syst..

[25]  Vaughn Betz,et al.  Architecture and CAD for Deep-Submicron FPGAS , 1999, The Springer International Series in Engineering and Computer Science.

[26]  Maya Gokhale,et al.  NAPA C: compiling for a hybrid RISC/FPGA architecture , 1998, Proceedings. IEEE Symposium on FPGAs for Custom Computing Machines (Cat. No.98TB100251).

[27]  Doug Simon,et al.  Assembly to high-level language translation , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[28]  Frank Vahid,et al.  SpecSyn: an environment supporting the specify-explore-refine paradigm for hardware/software system design , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[29]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[30]  Zheng Wang,et al.  System support for automatic profiling and optimization , 1997, SOSP.

[31]  Vaughn Betz,et al.  VPR: A new packing, placement and routing tool for FPGA research , 1997, FPL.

[32]  Jörg Henkel,et al.  A hardware/software partitioner using a dynamically determined granularity , 1997, DAC.

[33]  Todd M. Austin,et al.  The SimpleScalar tool set, version 2.0 , 1997, CARN.

[34]  John Wawrzynek,et al.  Garp: a MIPS processor with a reconfigurable coprocessor , 1997, Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat. No.97TB100186).

[35]  S. Turner,et al.  Performance Analysis Using the MIPS R10000 Performance Counters , 1996, Proceedings of the 1996 ACM/IEEE Conference on Supercomputing.

[36]  Cristina Cifuentes,et al.  Structuring Decompiled Graphs , 1996, CC.

[37]  Donatella Sciuto,et al.  Partitioning and exploration strategies in the TOSCA co-design flow , 1996, Proceedings of 4th International Workshop on Hardware/Software Co-Design. Codes/CASHE '96.

[38]  L. Gookin Partnering for success. , 1996, Home care provider.

[39]  Jörg Henkel,et al.  Hardware-software cosynthesis for microcontrollers , 1993, IEEE Design & Test of Computers.

[40]  Jonathan Rose,et al.  The effect of logic block architecture on FPGA performance , 1992 .

[41]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[42]  Daniel Brélaz,et al.  New methods to color the vertices of a graph , 1979, CACM.

[43]  Frank Vahid,et al.  A fast on-chip profiler memory using a pipelined binary tree , 2004, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Doug Simon,et al.  Preliminary experience with the use of the UQBT binary translation framework , 1999, PACT 1999.

[45]  Jonathan Rose,et al.  The Design of an SRAM-Based Field-Programmable Gate Array — Part I : Architecture , 1999 .

[46]  Petru Eles,et al.  System Level Hardware/Software Partitioning Based on Simulated Annealing and Tabu Search , 1997, Des. Autom. Embed. Syst..