论文信息 - A Linear Algebra Core Design for Efficient Level-3 BLAS

A Linear Algebra Core Design for Efficient Level-3 BLAS

Reducing power consumption and increasing efficiency is a key concern for many applications. It is well-accepted that specialization and heterogeneity are crucial strategies to improve both power and performance. Yet, how to design highly efficient processing elements while maintaining enough flexibility within a domain of applications is a fundamental question. In this paper, we present the design of a specialized Linear Algebra Core (LAC) for an important class of computational kernels, the level-3 Basic Linear Algebra Subprograms (BLAS). We demonstrate a detailed algorithm/architecture co-design for mapping a number of level-3 BLAS operations onto the LAC. Results show that our prototype LAC achieves a performance of around 64 GFLOPS (double precision) for these operations, while consuming less than 1.3 Watts in standard 45nm CMOS technology. This is on par with a full-custom design and up to 50× and 10× better in terms of power efficiency than CPUs and GPUs.

[1] S. Borkar,et al. An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[2] Sanjay J. Patel,et al. Rigel: an architecture and scalable programming interface for a 1000-core accelerator , 2009, ISCA '09.

[3] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Robert A. van de Geijn,et al. A high-performance, low-power linear algebra core , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[5] Ali R. Hurson,et al. General-purpose systolic arrays , 1993, Computer.

[6] Norman P. Jouppi,et al. Architecting Efficient Interconnects for Large Caches with CACTI 6.0 , 2008, IEEE Micro.

[7] Christoforos E. Kozyrakis,et al. Understanding sources of inefficiency in general-purpose chips , 2010, ISCA.

[8] Viktor K. Prasanna,et al. High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[9] Nam Sung Kim,et al. Energy-efficient floating-point arithmetic for software-defined radio architectures , 2011, ASAP 2011 - 22nd IEEE International Conference on Application-specific Systems, Architectures and Processors.

[10] Viktor K. Prasanna,et al. Energy- and time-efficient matrix multiplication on FPGAs , 2005, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11] Robert A. van de Geijn,et al. High-performance implementation of the level-3 BLAS , 2008, TOMS.

[12] Michael Parker. High-performance floating-point implementation using FPGAS , 2009, MILCOM 2009 - 2009 IEEE Military Communications Conference.