Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator

This paper examines the mapping of algorithms encountered when solving dense linear systems and linear least-squares problems to a custom Linear Algebra Processor. Specifically, the focus is on Cholesky, LU (with partial pivoting), and QR factorizations and their blocked algorithms. As part of the study, we expose the benefits of redesigning floating point units and their surrounding data-paths to support these complicated operations. We show how adding moderate complexity to the architecture greatly alleviates complexities in the algorithm. We study design tradeoffs and the effectiveness of architectural modifications to demonstrate that we can improve power and performance efficiency to a level that can otherwise only be expected of full-custom ASIC designs. A feasibility study of inner kernels is extended to blocked level and shows that, at block level, the Linear Algebra Core (LAC) can achieve high efficiencies with up to 45 GFLOPS/W for both Cholesky and LU factorization, and over 35 GFLOPS/W for QR factorization. While maintaining such efficiencies, our extensions to the MAC units can achieve up to 10, 12, and 20 percent speedup for the blocked algorithms of Cholesky, LU, and QR factorization, respectively.

[1]  Kinji Kimura,et al.  Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD , 2007, PaCT.

[2]  A. Alvandpour,et al.  A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.

[3]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[4]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[5]  Robert A. van de Geijn,et al.  Floating Point Architecture Extensions for Optimized Matrix Factorization , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.

[6]  Michael J. Schulte,et al.  A combined two's complement and floating-point comparator , 2005, 2005 IEEE International Symposium on Circuits and Systems.

[7]  Stuart F. Oberman,et al.  Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[8]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[9]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[10]  Robert A. van de Geijn,et al.  Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[11]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[12]  Norman P. Jouppi,et al.  Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[13]  Robert A. van de Geijn,et al.  On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[14]  Mark A. Richards,et al.  QR decomposition on GPUs , 2009, GPGPU-2.

[15]  J. L. Blue,et al.  A Portable Fortran Program to Find the Euclidean Norm of a Vector , 1978, TOMS.

[16]  Tze Meng Low,et al.  Accumulating Householder transformations, revisited , 2006, TOMS.

[17]  Javier D. Bruguera,et al.  High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root , 2002, IEEE Trans. Computers.

[18]  Gene H. Golub,et al.  An analysis of the total least squares problem , 1980, Milestones in Matrix Computation.

[19]  Javier D. Bruguera,et al.  High-speed function approximation using a minimax quadratic interpolator , 2005, IEEE Transactions on Computers.

[20]  Sriram R. Vangal,et al.  A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm , 2010, 2010 23rd International Conference on VLSI Design.

[21]  Arnaud Tisserand,et al.  Reciprocation, square root, inverse square root, and some elementary functions using small multipliers , 1998, Optics & Photonics.

[22]  Erdal Oruklu,et al.  Realization of area efficient QR factorization using unified division, square root, and inverse square root hardware , 2009, 2009 IEEE International Conference on Electro/Information Technology.

[23]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[24]  Robert A. van de Geijn,et al.  A Linear Algebra Core Design for Efficient Level-3 BLAS , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[25]  Jack J. Dongarra,et al.  Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.

[26]  Mei Han An,et al.  accuracy and stability of numerical algorithms , 1991 .

[27]  Kleanthis Psarris,et al.  Synthesizing Tiled Matrix Decomposition on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[28]  Rafael C Núñez,et al.  LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators , 2009 .

[29]  Jack J. Dongarra,et al.  QR factorization for the Cell Broadband Engine , 2009, Sci. Program..

[30]  Yong Dou,et al.  A High Performance and Memory Efficient LU Decomposer on FPGAs , 2012, IEEE Transactions on Computers.

[31]  Robert A. van de Geijn,et al.  A high-performance, low-power linear algebra core , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[32]  LeeserMiriam,et al.  Area and performance tradeoffs in floating-point divide and square-root implementations , 1996 .

[33]  Miriam Leeser,et al.  Area and performance tradeoffs in floating-point divide and square-root implementations , 1996, CSUR.