论文信息 - A high-performance, low-power linear algebra core

A high-performance, low-power linear algebra core

Achieving high-performance while reducing power consumption is a key concern as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieve orders of magnitude improvements in efficiency. The question is whether such efficiency can be maintained while providing enough flexibility to implement a broad class of operations. In this paper, we aim to answer this question for the domain of matrix computations. We propose a design of a novel linear algebra core and demonstrate that it can achieve orders of magnitude improvements in efficiency for matrix-matrix multiplication, an operation that is indicative for a broad class of matrix computations. A feasibility study shows that 47 double- and 104 single-precision GFLOPS/W can be achieved in 19.5 and 15.6 GFLOPS/mm2, respectively with current components and standard 45nm technology.1

[1] Michael Parker. High-performance floating-point implementation using FPGAS , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.

[2] A. Alvandpour,et al. A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.

[3] Hyesoon Kim,et al. An integrated GPU power and performance model , 2010, ISCA.

[4] Mitsuhisa Sato,et al. Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators , 2008, 2008 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[5] Michael J. Schulte,et al. Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support , 2009, IEEE Transactions on Computers.

[6] Georgi Kuzmanov,et al. Floating-Point Matrix Multiplication in a Polymorphic Processor , 2007, 2007 International Conference on Field-Programmable Technology.

[7] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[8] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Anand Raghunathan,et al. Power analysis of system-level on-chip communication architectures , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[10] Ryan W. Apperson,et al. AsAP: An Asynchronous Array of Simple Processors , 2008, IEEE Journal of Solid-State Circuits.

[11] Nicolai Petkov,et al. Hyper-systolic matrix multiplication , 1998, Parallel Comput..

[12] E.E. Swartzlander,et al. Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[13] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[14] Jack Dongarra,et al. LAPACK Users' guide (third ed.) , 1999 .

[15] Ken Smits,et al. Penryn: 45-nm next generation Intel® core™ 2 processor , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[16] Jack J. Dongarra,et al. Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17] Norman P. Jouppi,et al. Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[18] S. Borkar,et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[19] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[20] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21] Samuel Williams,et al. The potential of the cell processor for scientific computing , 2005, CF '06.

[22] Sriram R. Vangal,et al. A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm , 2010, 2010 23rd International Conference on VLSI Design.

[23] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24] Mark Horowitz,et al. Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[25] Stefania Perri,et al. A matrix product accelerator for field programmable systems on chip , 2008, Microprocess. Microsystems.

[26] Ed Anderson,et al. LAPACK Users' Guide , 1995 .

[27] Bruce Hendrickson,et al. The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..

[28] P.T. Wolkotte,et al. Energy Model of Networks-on-Chip and a Bus , 2005, 2005 International Symposium on System-on-Chip.

[29] Satoru Yamamoto,et al. FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods , 2010, TRETS.

[30] M. J. Prelle,et al. Performance of a Multicore Matrix Multiplication Library , 2007 .

[31] Andreas Gerstlauer,et al. Towards a High-Performance , Low-Power Linear Algebra Processor , 2010 .

[32] Jack Dongarra,et al. An Improved MAGMA GEMM for Fermi GPUs , 2010 .

[33] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[34] Brett M. Bode,et al. Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[35] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[36] Margaret Martonosi,et al. Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[37] Robert A. van de Geijn,et al. Anatomy of high-performance matrix multiplication , 2008, TOMS.

[38] Viktor K. Prasanna,et al. Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.

[39] Bo Kågström,et al. GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[40] Jack Dongarra,et al. ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.