A high-performance, low-power linear algebra core

Achieving high-performance while reducing power consumption is a key concern as technology scaling is reaching its limits. It is well-accepted that application-specific custom hardware can achieve orders of magnitude improvements in efficiency. The question is whether such efficiency can be maintained while providing enough flexibility to implement a broad class of operations. In this paper, we aim to answer this question for the domain of matrix computations. We propose a design of a novel linear algebra core and demonstrate that it can achieve orders of magnitude improvements in efficiency for matrix-matrix multiplication, an operation that is indicative for a broad class of matrix computations. A feasibility study shows that 47 double- and 104 single-precision GFLOPS/W can be achieved in 19.5 and 15.6 GFLOPS/mm2, respectively with current components and standard 45nm technology.1

[1]  Michael Parker High-performance floating-point implementation using FPGAS , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.

[2]  A. Alvandpour,et al.  A 6.2-GFlops Floating-Point Multiply-Accumulator With Conditional Normalization , 2006, IEEE Journal of Solid-State Circuits.

[3]  Hyesoon Kim,et al.  An integrated GPU power and performance model , 2010, ISCA.

[4]  Mitsuhisa Sato,et al.  Design and Power Performance Evaluation of On-Chip Memory Processor with Arithmetic Accelerators , 2008, 2008 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems.

[5]  Michael J. Schulte,et al.  Low-Power Multiple-Precision Iterative Floating-Point Multiplier with SIMD Support , 2009, IEEE Transactions on Computers.

[6]  Georgi Kuzmanov,et al.  Floating-Point Matrix Multiplication in a Polymorphic Processor , 2007, 2007 International Conference on Field-Programmable Technology.

[7]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[8]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Anand Raghunathan,et al.  Power analysis of system-level on-chip communication architectures , 2004, International Conference on Hardware/Software Codesign and System Synthesis, 2004. CODES + ISSS 2004..

[10]  Ryan W. Apperson,et al.  AsAP: An Asynchronous Array of Simple Processors , 2008, IEEE Journal of Solid-State Circuits.

[11]  Nicolai Petkov,et al.  Hyper-systolic matrix multiplication , 1998, Parallel Comput..

[12]  E.E. Swartzlander,et al.  Floating-Point Fused Multiply-Add Architectures , 2007, 2007 Conference Record of the Forty-First Asilomar Conference on Signals, Systems and Computers.

[13]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[14]  Jack Dongarra,et al.  LAPACK Users' guide (third ed.) , 1999 .

[15]  Ken Smits,et al.  Penryn: 45-nm next generation Intel® core™ 2 processor , 2007, 2007 IEEE Asian Solid-State Circuits Conference.

[16]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[17]  Norman P. Jouppi,et al.  Cacti 3. 0: an integrated cache timing, power, and area model , 2001 .

[18]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[19]  Sanjay J. Patel,et al.  Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[20]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21]  Samuel Williams,et al.  The potential of the cell processor for scientific computing , 2005, CF '06.

[22]  Sriram R. Vangal,et al.  A 90mW/GFlop 3.4GHz Reconfigurable Fused/Continuous Multiply-Accumulator for Floating-Point and Integer Operands in 65nm , 2010, 2010 23rd International Conference on VLSI Design.

[23]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24]  Mark Horowitz,et al.  Energy-Efficient Floating-Point Unit Design , 2011, IEEE Transactions on Computers.

[25]  Stefania Perri,et al.  A matrix product accelerator for field programmable systems on chip , 2008, Microprocess. Microsystems.

[26]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[27]  Bruce Hendrickson,et al.  The Torus-Wrap Mapping for Dense Matrix Calculations on Massively Parallel Computers , 1994, SIAM J. Sci. Comput..

[28]  P.T. Wolkotte,et al.  Energy Model of Networks-on-Chip and a Bus , 2005, 2005 International Symposium on System-on-Chip.

[29]  Satoru Yamamoto,et al.  FPGA-Array with Bandwidth-Reduction Mechanism for Scalable and Power-Efficient Numerical Simulations Based on Finite Difference Methods , 2010, TRETS.

[30]  M. J. Prelle,et al.  Performance of a Multicore Matrix Multiplication Library , 2007 .

[31]  Andreas Gerstlauer,et al.  Towards a High-Performance , Low-Power Linear Algebra Processor , 2010 .

[32]  Jack Dongarra,et al.  An Improved MAGMA GEMM for Fermi GPUs , 2010 .

[33]  Christoforos E. Kozyrakis,et al.  Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[34]  Brett M. Bode,et al.  Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[35]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[36]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[37]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.

[38]  Viktor K. Prasanna,et al.  Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.

[39]  Bo Kågström,et al.  GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark , 1998, TOMS.

[40]  Jack Dongarra,et al.  ScaLAPACK: a scalable linear algebra library for distributed memory concurrent computers , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.