Algorithm and architecture for a high density, low power scalar product macrocell

The authors present a design approach for an arithmetic macrocell that computes the scalar product of two vectors, an operation ubiquitously present in the solution of many communications and digital signal processing problems. The core of the proposed architecture is a full combinational design containing a partial product generator, a partial product accumulator and a vector accumulator. The design addresses the competing optimisation goals of VLSI area, power dissipation and latency in the deep submicron regime. Compared with conventional merged arithmetic architectures, the proposed macrocell design represents a substantial improvement in the VLSI layout with little area wastage, a high degree of regularity and a good scalability for different vector lengths and operand widths. A theoretical analysis shows that the design of a 16-bit scalar product multiplier for input vectors with 16 elements, in comparison with traditionally designed architecture, achieves a saving of 38.6% in the silicon area, an up to 73% increase in the area usage efficiency and a 29.4% saving in the interconnect delay. Post-layout simulations of the proposed circuit, based on a 0.18 /spl mu/m CMOS process, show an average power dissipation of 64.96 mW and a latency of 6.92 ns at a standard supply voltage of 1.8 V, a superior performance for a single cycle instruction in a high-speed, low voltage 16-bit digital signal processor operating at 144 MHz. The use of shorter interconnects and more equalised interconnect delays, leads to the power dissipation and delay incurred by the interconnects being substantially reduced. Post-layout simulation of our proposed circuit at supply voltages ranging from 0.7 to 3.3 V shows a significant power reduction of 6 to 13% over the pre-layout simulation results of the conventional design.

[1]  S. S. Nayak,et al.  High throughput VLSI implementation of discrete orthogonal transforms using bit-level vector-matrix multiplier , 1999 .

[2]  Luca Breveglieri,et al.  A VLSI inner product macrocell , 1998, IEEE Trans. Very Large Scale Integr. Syst..

[3]  Khurram Muhammad,et al.  Speed, power, area, and latency tradeoffs in adaptive FIR filtering for PRML read channels , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[4]  Mislav Grgic,et al.  Performance analysis of image compression using wavelets , 2001, IEEE Trans. Ind. Electron..

[5]  E. E. Swartzlander,et al.  Complexity of merged two's complement multiplier-adders , 1999, 42nd Midwest Symposium on Circuits and Systems (Cat. No.99CH36356).

[6]  Chip-Hong Chang,et al.  An interconnect optimized floorplanning of a scalar product macrocell , 2002, 2002 IEEE International Symposium on Circuits and Systems. Proceedings (Cat. No.02CH37353).

[7]  Jacques C. Rudell,et al.  A 50 MHz eight-tap adaptive equalizer for partial-response channels , 1995 .

[8]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[9]  Keshab K. Parhi,et al.  Relaxed look-ahead pipelined LMS adaptive filters and their application to ADPCM coder , 1993 .

[10]  Tadayoshi Enomoto,et al.  A 200-MFLOPS 100-MHz 64-b BiCMOS vector-pipelined processor (VPP) ULSI , 1991 .

[11]  James E. Gunn,et al.  A low-power DSP core-based software radio architecture , 1999, IEEE J. Sel. Areas Commun..

[12]  Graham A. Jullien,et al.  A New Design Technique for Column Compression Multipliers , 1995, IEEE Trans. Computers.

[13]  Kurt Keutzer,et al.  Getting to the bottom of deep submicron , 1998, ICCAD '98.

[14]  Tomás Lang,et al.  Low-Power Divider , 1999, IEEE Trans. Computers.

[15]  Rong Lin Reconfigurable parallel inner product processor architectures , 2001, IEEE Trans. Very Large Scale Integr. Syst..

[16]  L.J. Karam,et al.  Canonic signed digit Chebyshev FIR filter design , 2001, IEEE Signal Processing Letters.

[17]  E. Swartzlander Merged Arithmetic , 1980, IEEE Transactions on Computers.