Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator
暂无分享,去创建一个
Robert A. van de Geijn | Andreas Gerstlauer | Ardavan Pedram | R. V. D. Geijn | A. Gerstlauer | A. Pedram
[1] Kinji Kimura,et al. Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD , 2007, PaCT.
[2] A. Alvandpour,et al. A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.
[3] Jason N. Dale,et al. Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..
[4] Dinesh Manocha,et al. LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[5] Robert A. van de Geijn,et al. Floating Point Architecture Extensions for Optimized Matrix Factorization , 2013, 2013 IEEE 21st Symposium on Computer Arithmetic.
[6] Michael J. Schulte,et al. A combined two's complement and floating-point comparator , 2005, 2005 IEEE International Symposium on Circuits and Systems.
[7] Stuart F. Oberman,et al. Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).
[8] Mark Horowitz,et al. Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.
[9] Charles L. Lawson,et al. Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.
[10] Robert A. van de Geijn,et al. Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.
[11] Emmanuel Agullo,et al. QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.
[12] Norman P. Jouppi,et al. Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.
[13] Robert A. van de Geijn,et al. On the Efficiency of Register File versus Broadcast Interconnect for Collective Communications in Data-Parallel Hardware Accelerators , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.
[14] Mark A. Richards,et al. QR decomposition on GPUs , 2009, GPGPU-2.
[15] J. L. Blue,et al. A Portable Fortran Program to Find the Euclidean Norm of a Vector , 1978, TOMS.
[16] Tze Meng Low,et al. Accumulating Householder transformations, revisited , 2006, TOMS.
[17] Javier D. Bruguera,et al. High-Speed Double-Precision Computation of Reciprocal, Division, Square Root and Inverse Square Root , 2002, IEEE Trans. Computers.
[18] Gene H. Golub,et al. An analysis of the total least squares problem , 1980, Milestones in Matrix Computation.
[19] Javier D. Bruguera,et al. High-speed function approximation using a minimax quadratic interpolator , 2005, IEEE Transactions on Computers.
[20] Sriram R. Vangal,et al. A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm , 2010, 2010 23rd International Conference on VLSI Design.
[21] Arnaud Tisserand,et al. Reciprocation, square root, inverse square root, and some elementary functions using small multipliers , 1998, Optics & Photonics.
[22] Erdal Oruklu,et al. Realization of area efficient QR factorization using unified division, square root, and inverse square root hardware , 2009, 2009 IEEE International Conference on Electro/Information Technology.
[23] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[24] Robert A. van de Geijn,et al. A Linear Algebra Core Design for Efficient Level-3 BLAS , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.
[25] Jack J. Dongarra,et al. Solving Systems of Linear Equations on the CELL Processor Using Cholesky Factorization , 2008, IEEE Transactions on Parallel and Distributed Systems.
[26] Mei Han An,et al. accuracy and stability of numerical algorithms , 1991 .
[27] Kleanthis Psarris,et al. Synthesizing Tiled Matrix Decomposition on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.
[28] Rafael C Núñez,et al. LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators , 2009 .
[29] Jack J. Dongarra,et al. QR factorization for the Cell Broadband Engine , 2009, Sci. Program..
[30] Yong Dou,et al. A High Performance and Memory Efficient LU Decomposer on FPGAs , 2012, IEEE Transactions on Computers.
[31] Robert A. van de Geijn,et al. A high-performance, low-power linear algebra core , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.
[32] LeeserMiriam,et al. Area and performance tradeoffs in floating-point divide and square-root implementations , 1996 .
[33] Miriam Leeser,et al. Area and performance tradeoffs in floating-point divide and square-root implementations , 1996, CSUR.