Design and implementation of a floating point unit for rigel, a massively parallel accelerator

Scientific applications rely heavily on floating point data types. Floating point operations are complex and require complicated hardware that is both area and power intensive. The emergence of massively parallel architectures like Rigel creates new challenges and poses new questions with respect to floating point support. The massively parallel aspect of Rigel places great emphasis on area efficient, low power designs. At the same time, Rigel is a general purpose accelerator and must provide high performance for a wide class of applications. This thesis presents an analysis of various floating point unit (FPU) components with respect to Rigel, and attempts to present a candidate design of an FPU that balances performance, area, and power and is suitable for massively parallel architectures like Rigel.

[1]  Jean-Michel Muller,et al.  Some functions computable with a fused-mac , 2005, 17th IEEE Symposium on Computer Arithmetic (ARITH'05).

[2]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[3]  Michael Gschwind Chip multiprocessing and the cell broadband engine , 2006, CF '06.

[4]  Michael J. Flynn,et al.  A Variable Latency Pipelined Floating-Point Adder , 1996, Euro-Par, Vol. II.

[5]  D. H. Jacobsohn,et al.  A Suggestion for a Fast Multiplier , 1964, IEEE Trans. Electron. Comput..

[6]  Pradeep Dubey,et al.  Larrabee: A Many-Core x86 Architecture for Visual Computing , 2009, IEEE Micro.

[7]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[8]  Cheng-Chew Lim,et al.  Reduced latency IEEE floating-point standard adder architectures , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[9]  O. L. Macsorley High-Speed Arithmetic in Binary Computers , 1961, Proceedings of the IRE.

[10]  Zhan Yu,et al.  Improved-Booth encoding for low-power multipliers , 1999, ISCAS'99. Proceedings of the 1999 IEEE International Symposium on Circuits and Systems VLSI (Cat. No.99CH36349).

[11]  Stuart Franklin Oberman,et al.  Design issues in high performance floating point arithmetic units , 1996 .

[12]  David W. Matula,et al.  Recoders for Partial Compression and Rounding , 1997 .

[13]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[14]  Paul Michael Farmwald,et al.  On the design of high performance digital arithmetic units , 1981 .

[15]  Guy Even,et al.  An IEEE Compliant Floating-Point Adder that Conforms with the Pipelined Packet-Forwarding Paradigm , 2000, IEEE Trans. Computers.

[16]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[17]  Sanjay J. Patel,et al.  Rigel : A Scalable Architecture for 1000 + Core Accelerators , 2009 .

[18]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[19]  Hiroaki Suzuki,et al.  Leading-zero anticipatory logic for high-speed floating point addition , 1995, Proceedings of the IEEE 1995 Custom Integrated Circuits Conference.

[20]  Michael J. Flynn,et al.  The SNAP project: design of floating point arithmetic units , 1997, Proceedings 13th IEEE Sympsoium on Computer Arithmetic.

[21]  Michael J. Flynn,et al.  Division Algorithms and Implementations , 1997, IEEE Trans. Computers.