An Evaluation of Selective Depipelining for FPGA-Based Energy-Reducing Irregular Code Coprocessors

As the complexity of FPGA-based systems scales, the importance of efficiently handling irregular code increases. Recent work has proposed Irregular Code Energy Reducers (ICERs), a high-level synthesis approach for FPGAs that offers significant energy reduction for irregular code compared to a soft core processor. ICERs target the hot-spots of programs, and are seamlessly connected via a shared L1 cache with a soft processor that executes the cold code. This paper evaluates the application of the selective depipelining (SDP) technique to ICERs, which greatly reduces both the execution time and energy of irregular computations. SDP enables irregular computations to be expressed as large, fast, low-power combinational blocks. SDP maintains high memory bandwidth by scheduling the many potentially dependent memory operations within these blocks onto a high-frequency, highly-multiplexed coherent memory while scheduling combinational operations at a much lower frequency. SDP is a key enabler for improving the execution properties of irregular computations that are difficult to parallelize. We show that applying SDP to ICERs reduces energy-delay by 2.62× relative to ICERs. ICERs with SDP are up to 2.38× faster than a soft core processor and reduce energy consumption by up to 15.83× for a variety of irregular applications.

[1]  Paolo Ienne,et al.  Automatic application-specific instruction-set extensions under microarchitectural constraints , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[2]  Scott A. Mahlke,et al.  PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators , 2002, J. VLSI Signal Process..

[3]  John Wawrzynek,et al.  The Garp Architecture and C Compiler , 2000, Computer.

[4]  Shail Aditya,et al.  Cycle-time aware architecture synthesis of custom hardware accelerators , 2002, CASES '02.

[5]  David J. Lau,et al.  Automated Generation of Hardware Accelerators with Direct Memory Access from ANSI/ISO Standard C Functions , 2006, 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines.

[6]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[7]  Jason Cong,et al.  AutoPilot: A Platform-Based ESL Synthesis System , 2008 .

[8]  Ralph Wittig,et al.  Performance and power of cache-based reconfigurable computing , 2009, ISCA '09.

[9]  Wayne Luk,et al.  The Impact of Pipelining on Energy per Operation in Field-Programmable Gate Arrays , 2004, FPL.

[10]  Steven Swanson,et al.  Conservation cores: reducing the energy of mature computations , 2010, ASPLOS XV.

[11]  Clifford Stein,et al.  Introduction to Algorithms, 2nd edition. , 2001 .

[12]  Seth Copen Goldstein,et al.  Tartan: evaluating spatial computation for whole program execution , 2006, ASPLOS XII.

[13]  Manish Arora,et al.  Reducing the Energy Cost of Irregular Code Bases in Soft Processor Systems , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[14]  Charles E. Leiserson,et al.  Retiming synchronous circuitry , 1988, Algorithmica.

[15]  Russell Tessier,et al.  Application Specific Customization and Scalability of Soft Multiprocessors , 2009, 2009 17th IEEE Symposium on Field Programmable Custom Computing Machines.

[16]  Román Hermida,et al.  Area Optimization of Multi-Cycle Operators in High-Level Synthesis , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[17]  Philippe Coussy,et al.  High-Level Synthesis: from Algorithm to Digital Circuit , 2008 .

[18]  Frank Vahid,et al.  Warp Processing: Dynamic Translation of Binaries to FPGA Circuits , 2008, Computer.

[19]  Scott A. Mahlke,et al.  CGRA express: accelerating execution using dynamic operation fusion , 2009, CASES '09.

[20]  Gabriel H. Loh,et al.  Static strands: safely collapsing dependence chains for increasing embedded power efficiency , 2005, LCTES '05.

[21]  Steven Swanson,et al.  Efficient complex operators for irregular codes , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[22]  Andreas Moshovos,et al.  CHIMAERA: a high-performance architecture with a tightly-coupled reconfigurable functional unit , 2000, ISCA '00.

[23]  Gabriel H. Loh,et al.  Static strands: Safely exposing dependence chains for increasing embedded power efficiency , 2007, TECS.

[24]  Scott A. Mahlke,et al.  High-level synthesis of nonprogrammable hardware accelerators , 2000, Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors.

[25]  Nikil D. Dutt,et al.  Performance and energy benefits of instruction set extensions in an FPGA soft core , 2006, 19th International Conference on VLSI Design held jointly with 5th International Conference on Embedded Systems Design (VLSID'06).

[26]  Henry Hoffmann,et al.  Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[27]  Philippe Coussy,et al.  High-Level Synthesis , 2008 .

[28]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[29]  Mikko H. Lipasti,et al.  Macro-op Scheduling: Relaxing Scheduling Loop Constraints , 2003, MICRO.