论文信息 - A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor

A fully pipelined single-precision floating-point unit in the synergistic processor element of a CELL processor

The floating-point unit (FPU) in the synergistic processor element (SPE) of a CELL processor is a fully pipelined 4-way single-instruction multiple-data (SIMD) unit designed to accelerate media and data streaming with 128-bit operands. It supports 32-bit single-precision floating-point and 16-bit integer operands with two different latencies, six-cycle and seven-cycle, with 11 FO4 delay per stage. The FPU optimizes the performance of critical single-precision multiply-add operations. Since exact rounding, exceptions, and de-norm number handling are not important to multimedia applications, IEEE correctness on the single-precision floating-point numbers is sacrificed for performance and simple design. It employs fine-grained clock gating for power saving. The design has 768K transistors in 1.3 mm/sup 2/, fabricated SOI in 90-nm technology. Correct operations have been observed up to 5.6 GHz with 1.4 V and 56/spl deg/C, delivering 44.8 GFlops. Architecture, logic, circuits, and integration are codesigned to meet the performance, power, and area goals.

[1] S.H. Dhong,et al. A fully-pipelined single-precision floating point unit in the synergistic processor element of a CELL processor , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[2] Peter A. Sandon,et al. PowerPC 970 in 130nm and 90nm technologies , 2004 .

[3] C. Lichtenau,et al. PowerPC 970 in 130 nm and 90 nm technologies , 2004, 2004 IEEE International Solid-State Circuits Conference (IEEE Cat. No.04CH37519).

[4] Sang H. Dhong,et al. Power-conscious design of the Cell processor's synergistic processor element , 2005, IEEE Micro.

[5] Michael Gschwind,et al. Integrated analysis of power and performance for pipelined microprocessors , 2004, IEEE Transactions on Computers.

[6] Israel Koren. Computer arithmetic algorithms , 1993 .

[7] S. Asano,et al. The design and implementation of a first-generation CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[8] Sang H. Dhong,et al. The vector floating-point unit in a synergistic processor element of a CELL processor , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[9] B. Flachs,et al. The circuits and physical design of the synergistic processor element of a CELL processor , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[10] Wolfgang J. Paul,et al. Computer architecture - complexity and correctness , 2000 .

[11] Behrooz Parhami,et al. Computer arithmetic - algorithms and hardware designs , 1999 .

[12] B. Flachs,et al. A streaming processing unit for a CELL processor , 2005, ISSCC. 2005 IEEE International Digest of Technical Papers. Solid-State Circuits Conference, 2005..

[13] Norman P. Jouppi,et al. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays , 2002, ISCA.

[14] Vojin G. Oklobdzija,et al. A Method for Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using an Algorithmic Approach , 1996, IEEE Trans. Computers.