A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor

The floating-point unit (FPU) in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unit designed to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision floating-point and 16-bit integer operands with two different latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply-add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision floating-point numbers is sacrificed for performance and simple design. It employs fine-grained clock gating for power saving. The design has 768K transistors in 1.3 mm/sup 2/, fabricated SOI in 90-nm technology. Correct operations have been observed up to 5.6 GHz with 1.4 V and 56/spl deg/C, delivering 44.8 GFlops. Architecture, logic, circuits, and integration are codesigned to meet the performance, power, and area goals.

[1]  S.H. Dhong,et al.  A fully-pipelined single-precision floating point unit in the synergistic processor element of a CELL processor , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[2]  Peter A. Sandon,et al.  PowerPC 970 in 130nm and 90nm technologies , 2004 .

[3]  C. Lichtenau,et al.  PowerPC 970 in 130 nm and 90 nm technologies , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[4]  Sang H. Dhong,et al.  Power-conscious design of the Cell processor's synergistic processor element , 2005, IEEE Micro.

[5]  Michael Gschwind,et al.  Integrated analysis of power and performance for pipelined microprocessors , 2004, IEEE Transactions on Computers.

[6]  Israel Koren Computer arithmetic algorithms , 1993 .

[7]  S. Asano,et al.  The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[8]  Sang H. Dhong,et al.  The vector floating-point unit in a synergistic processor element of a CELL processor , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[9]  B. Flachs,et al.  The circuits and physical design of the synergistic processor element of a CELL processor , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[10]  Wolfgang J. Paul,et al.  Computer architecture - complexity and correctness , 2000 .

[11]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[12]  B. Flachs,et al.  A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[13]  Norman P. Jouppi,et al.  The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[14]  Vojin G. Oklobdzija,et al.  A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach , 1996, IEEE Trans. Computers.