A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

We propose a high-speed and energy-efficient two-cycle multiply-accumulate (MAC) architecture that supports two's complement numbers, and includes accumulation guard bits and saturation circuitry. The first MAC pipeline stage contains only partial-product generation circuitry and a reduction tree, while the second stage, thanks to a special sign-extension solution, implements all other functionality. Place-and-route evaluations using a 65-nm 1.1-V cell library show that the proposed architecture offers a 31% improvement in speed and a 32% reduction in energy per operation, averaged across operand sizes of 16, 32, 48, and 64 bits, over a reference two-cycle MAC architecture that employs a multiplier in the first stage and an accumulator in the second. When operating the proposed architecture at the lower frequency of the reference architecture the available timing slack can be used to downsize gates, resulting in a 52% reduction in energy compared to the reference. We extend the new architecture to create a versatile double-throughput MAC (DTMAC) unit that efficiently performs either multiply-accumulate or multiply operations for N-bit, 1 × N/2-bit, or 2 × N/2-bit operands. In comparison to a fixed-function 32-bit MAC unit, 16-bit multiply-accumulate operations can be executed with 67% higher energy efficiency on a 32-bit DTMAC unit.

[1]  B. Bloechel,et al.  A 4-GHz 300-mW 64-bit integer execution ALU with dual supply voltages in 90-nm CMOS , 2004, IEEE Journal of Solid-State Circuits.

[2]  Vojin G. Oklobdzija,et al.  A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach , 1996, IEEE Trans. Computers.

[3]  Mark Horowitz,et al.  SPIM: a pipelined 64*64-bit iterative multiplier , 1989 .

[4]  Jianhua Liu,et al.  An Algorithmic Approach for Generic Parallel Adders , 2003, ICCAD 2003.

[5]  Mary Sheeran,et al.  Multiplier reduction tree with logarithmic logic depth and regular connectivity , 2006, 2006 IEEE International Symposium on Circuits and Systems.

[6]  Sangjin Hong,et al.  Reconfigurable embedded MAC core design for low-power coarse-grain FPGA , 2003 .

[7]  Bruce A. Wooley,et al.  A Two's Complement Parallel Array Multiplication Algorithm , 1973, IEEE Transactions on Computers.

[8]  Yoshikazu Miyanaga,et al.  Use of a Variable Wordlength Technique in an OFDM Receiver to Reduce Energy Dissipation , 2008, IEEE Transactions on Circuits and Systems I: Regular Papers.

[9]  Magnus Själander,et al.  An efficient twin-precision multiplier , 2004, IEEE International Conference on Computer Design: VLSI in Computers and Processors, 2004. ICCD 2004. Proceedings..

[10]  Milos D. Ercegovac,et al.  Digital Arithmetic , 2003, Wiley Encyclopedia of Computer Science and Engineering.

[11]  Tung Thanh Hoang,et al.  High-speed, energy-efficient 2-cycle Multiply-Accumulate architecture , 2009, 2009 IEEE International SOC Conference (SOCC).

[12]  Michael Allen,et al.  High performance dual-MAC DSP architecture , 2002, IEEE Signal Process. Mag..

[13]  Vojin G. Oklobdzija,et al.  Implementing multiply-accumulate operation in multiplication time , 1997, Proceedings 13th IEEE Sympsoium on Computer Arithmetic.

[14]  Magdy A. Bayoumi,et al.  High Speed and Area-Efficient Multiply Accumulate (MAC) Unit for Digital Signal Prossing Applications , 2007, 2007 IEEE International Symposium on Circuits and Systems.

[15]  Johann Großschädl,et al.  A single-cycle (32/spl times/32+32+64)-bit multiply/accumulate unit for digital signal processing and public-key cryptography , 2003, 10th IEEE International Conference on Electronics, Circuits and Systems, 2003. ICECS 2003. Proceedings of the 2003.

[16]  Margaret Martonosi,et al.  Dynamically exploiting narrow width operands to improve processor power and performance , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[17]  Magnus Själander,et al.  Multiplication Acceleration Through Twin Precision , 2009, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[18]  Tung Thanh Hoang,et al.  Double Throughput Multiply-Accumulate unit for FlexCore processor enhancements , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[19]  M. Hatamian,et al.  A 70-MHz 8-bit/spl times/8-bit parallel pipelined multiplier in 2.5-/spl mu/m CMOS , 1986 .

[20]  Jack Sklansky,et al.  Conditional-Sum Addition Logic , 1960, IRE Trans. Electron. Comput..

[21]  Shiann-Rong Kuang,et al.  Design of Power-Efficient Configurable Booth Multiplier , 2010, IEEE Transactions on Circuits and Systems I: Regular Papers.

[22]  Chein-Wei Jen,et al.  High-Speed Booth Encoded Parallel Multiplier Design , 2000, IEEE Trans. Computers.

[23]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[24]  Magnus Själander,et al.  FlexCore: Utilizing Exposed Datapath Control for Efficient Computing , 2007, 2007 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation.

[25]  O. L. Macsorley High-Speed Arithmetic in Binary Computers , 1961, Proceedings of the IRE.