Comparing FPGA vs. custom cmos and the impact on processor microarchitecture

As soft processors are increasingly used in diverse applications, there is a need to evolve their microarchitectures in a way that suits the FPGA implementation substrate. This paper compares the delay and area of a comprehensive set of processor building block circuits when implemented on custom CMOS and FPGA substrates. We then use the results of these comparisons to infer how the microarchitecture of soft processors on FPGAs should be different from hard processors on custom CMOS. We find that the ratios of the area required by an FPGA to that of custom CMOS for different building blocks varies significantly more than the speed ratios. As area is often a key design constraint in FPGA circuits, area ratios have the most impact on microarchitecture choices. Complete processor cores have area ratios of 17-27x and delay ratios of 18-26x. Building blocks that have dedicated hardware support on FPGAs such as SRAMs, adders, and multipliers are particularly area-efficient (2-7x area ratio), while multiplexers and CAMs are particularly area-inefficient (>100x area ratio), leading to cheaper ALUs, larger caches of low associativity, and more expensive bypass networks than on similar hard processors. We also find that a low delay ratio for pipeline latches (12-19x) suggests soft processors should have pipeline depths 20% greater than hard processors of similar complexity.

[1]  J.B. Kuang,et al.  The design and implementation of double-precision multiplier in a first-generation CELL processor , 2005, 2005 International Conference on Integrated Circuit Design and Technology, 2005. ICICDT 2005..

[2]  Vaughn Betz,et al.  The Stratix II logic and routing architecture , 2005, FPGA '05.

[3]  J. Gregory Steffan,et al.  The microarchitecture of FPGA-based soft processors , 2005, CASES '05.

[4]  A.J. Al-Khalili,et al.  Performance of Parallel Prefix Adders implemented with FPGA technology , 2007, 2007 IEEE Northeast Workshop on Circuits and Systems.

[5]  Belliappa Kuttanna,et al.  A Sub-2 W Low Power IA Processor for Mobile Internet Devices in 45 nm High-k Metal Gate CMOS , 2009, IEEE Journal of Solid-State Circuits.

[6]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[7]  P. Bai,et al.  An advanced low power, high performance, strained channel 65nm technology , 2005, IEEE InternationalElectron Devices Meeting, 2005. IEDM Technical Digest..

[8]  Yuen H. Chan,et al.  IBM POWER6 SRAM arrays , 2007, IBM J. Res. Dev..

[9]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[10]  Paul Metzgen,et al.  A high performance 32-bit ALU for programmable logic , 2004, FPGA '04.

[11]  Jean-Louis Brelet,et al.  Using Virtex-II Block RAM for High Performance Read/Write CAMs , 2002 .

[12]  Xiao Yan Zhang,et al.  A 270ps 20mW 108-bit End-around Carry Adder for Multiply-Add Fused Floating Point Unit , 2010, J. Signal Process. Syst..

[13]  C.C. Chen,et al.  65nm CMOS high speed, general purpose and low power transistor technology for high volume foundry application , 2004, Digest of Technical Papers. 2004 Symposium on VLSI Technology, 2004..

[14]  Amir Roth,et al.  Mini-graph processing , 2008 .

[15]  Allan Hartstein,et al.  The optimum pipeline depth for a microprocessor , 2002, ISCA.

[16]  Sanu Mathew,et al.  A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2007, IEEE J. Solid State Circuits.

[17]  Michael Zhang,et al.  Highly-Associative Caches for Low-Power Processors , 2000 .

[18]  J.D. Meindl,et al.  Optimal interconnection circuits for VLSI , 1985, IEEE Transactions on Electron Devices.

[19]  Jian Wang,et al.  Godson-3: A Scalable Multicore RISC Processor with x86 Emulation , 2009, IEEE Micro.

[20]  G. Palumbo,et al.  Interconnect-Aware Design of Fast Large Fan-In CMOS Multiplexers , 2007, IEEE Transactions on Circuits and Systems II: Express Briefs.

[21]  A. Kumar,et al.  Implementation of an 8-Core, 64-Thread, Power-Efficient SPARC Server on a Chip , 2008, IEEE Journal of Solid-State Circuits.

[22]  Jonathan Rose,et al.  Measuring the Gap Between FPGAs and ASICs , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[23]  David G. Chinnery,et al.  Closing the Power Gap between ASIC and Custom - Tools and Techniques for Low Power Design , 2005 .

[24]  Eric Sprangle,et al.  Increasing processor performance by implementing deeper pipelines , 2002, ISCA.

[25]  Mateo Valero,et al.  A decoupled KILO-instruction processor , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[26]  David G. Chinnery,et al.  Closing the Gap Between ASIC and Custom - Tools and Techniques for High-Performance ASIC Design , 2002 .

[27]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[28]  Peter G. Sassone,et al.  Dynamic Strands: Collapsing Speculative Dependence Chains for Reducing Pipeline Communication , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[29]  Stamatis Vassiliadis,et al.  High-Performance 3-1 Interlock Collapsing ALU's , 1994, IEEE Trans. Computers.

[30]  S SohiGurindar Instruction Issue Logic for High-Performance, Interruptible, Multiple Functional Unit, Pipelined Computers , 1990 .

[31]  Stratix II Device Handbook, Volume 1 , 2006 .

[32]  D. Jamsek,et al.  An 8GHz floating-point multiply , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[33]  R. Krishnamurthy,et al.  An 8.8GHz 198mW 16x64b 1R/1W variationtolerant register file in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[34]  S. Hsu,et al.  A 110 GOPS/W 16-bit multiplier and reconfigurable PLA loop in 90-nm CMOS , 2005, IEEE Journal of Solid-State Circuits.

[35]  Igor Arsovski,et al.  Self-referenced sense amplifier for across-chip-variation immune sensing in high-performance Content-Addressable Memories , 2006, IEEE Custom Integrated Circuits Conference 2006.

[36]  Hong Wang,et al.  Intel® atom™ processor core made FPGA-synthesizable , 2009, FPGA '09.

[37]  Leland Chang,et al.  A 5.3GHz 8T-SRAM with Operation Down to 0.41V in 65nm CMOS , 2007, 2007 IEEE Symposium on VLSI Circuits.

[38]  Vikas Agarwal,et al.  Clock rate versus IPC: the end of the road for conventional microarchitectures , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[39]  Xiang Zou,et al.  Intel nehalem processor core made FPGA synthesizable , 2010, FPGA.

[40]  Azita Emami-Neyestanak,et al.  Tertiary-Tree 12-GHz 32-bit Adder in 65nm Technology , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[41]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[42]  B. Nikolic,et al.  A 240ps 64b carry-lookahead adder in 90nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[43]  R. Chau,et al.  A 45nm Logic Technology with High-k+Metal Gate Transistors, Strained Silicon, 9 Cu Interconnect Layers, 193nm Dry Patterning, and 100% Pb-free Packaging , 2007, 2007 IEEE International Electron Devices Meeting.

[44]  P. Bai,et al.  A 65nm logic technology featuring 35nm gate lengths, enhanced channel strain, 8 Cu interconnect layers, low-k ILD and 0.57 /spl mu/m/sup 2/ SRAM cell , 2004, IEDM Technical Digest. IEEE International Electron Devices Meeting, 2004..

[45]  J. Rose,et al.  Mapping multiplexers onto hard multipliers in FPGAs , 2005, The 3rd International IEEE-NEWCAS Conference, 2005..

[46]  R.K. Krishnamurthy,et al.  A 9-GHz 65-nm Intel® Pentium 4 Processor Integer Execution Unit , 2006, IEEE Journal of Solid-State Circuits.

[47]  Michael C. Huang,et al.  SEED: Scalable, efficient enforcement of dependences , 2006, 2006 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[48]  Shih-Lien Lu,et al.  An FPGA-based Pentium® in a complete desktop system , 2007, FPGA '07.

[49]  M. Khellah,et al.  A 4.2GHz 0.3mm2 256kb Dual-V/sub cc/ SRAM Building Block in 65nm CMOS , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[50]  J. Gregory Steffan,et al.  Efficient multi-ported memories for FPGAs , 2010, FPGA '10.

[51]  Rajesh Kumar,et al.  A family of 45nm IA processors , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[52]  D. Plass,et al.  A 5.6GHz 64kB Dual-Read Data Cache for the POWER6TM Processor , 2006, 2006 IEEE International Solid State Circuits Conference - Digest of Technical Papers.

[53]  Stamatis Vassiliadis,et al.  Interlock collapsing ALU for increased instruction-level parallelism , 1992, MICRO.

[54]  Andrew R. Pleszkun,et al.  Implementing Precise Interrupts in Pipelined Processors , 1988, IEEE Trans. Computers.

[55]  Himanshu Kaul,et al.  A Dual-Supply 4GHz 13fJ/bit/search 64×128b CAM in 65nm CMOS , 2006, 2006 Proceedings of the 32nd European Solid-State Circuits Conference.

[56]  Kieran McLaughlin,et al.  Exploring CAM Design For Network Processing Using FPGA Technology , 2006, Advanced Int'l Conference on Telecommunications and Int'l Conference on Internet and Web Applications and Services (AICT-ICIW'06).

[57]  Paul Metzgen,et al.  Multiplexer restructuring for FPGA implementation cost reduction , 2005, Proceedings. 42nd Design Automation Conference, 2005..