Accelerating BLAS on Custom Architecture through Algorithm-Architecture Co-design

Basic Linear Algebra Subprograms (BLAS) play key role in high performance and scientific computing applications. Experimentally, yesteryear multicore and General Purpose Graphics Processing Units (GPGPUs) are capable of achieving up to 15 to 57% of the theoretical peak performance at 65W to 240W respectively for compute bound operations like Double/Single Precision General Matrix Multiplication (XGEMM). For bandwidth bound operations like Single/Double precision Matrix-vector Multiplication (XGEMV) the performance is merely 5 to 7% of the theoretical peak performance in multicores and GPGPUs respectively. Achieving performance in BLAS requires moving away from conventional wisdom and evolving towards customized accelerator tailored for BLAS through algorithm-architecture co-design. In this paper, we present acceleration of Level-1 (vector operations), Level-2 (matrix-vector operations), and Level-3 (matrix-matrix operations) BLAS through algorithm architecture co-design on a Coarse-grained Reconfigurable Architecture (CGRA). We choose REDEFINE CGRA as a platform for our experiments since REDEFINE can be adapted to support domain of interest through tailor-made Custom Function Units (CFUs). For efficient sequential realization of BLAS, we present design of a Processing Element (PE) and perform micro-architectural enhancements in the PE to achieve up-to 74% of the theoretical peak performance of PE in DGEMM, 40% in DGEMV and 20% in double precision inner product (DDOT). We attach this PE to REDEFINE CGRA as a CFU and show the scalability of our solution. Finally, we show performance improvement of 3-140x in PE over commercially available Intel micro-architectures, ClearSpeed CSX700, FPGA, and Nvidia GPGPUs.

[1]  Kleanthis Psarris,et al.  Scalable matrix decompositions with multiple cores on FPGAs , 2013, Microprocess. Microsystems.

[2]  Robert A. van de Geijn,et al.  Algorithm, Architecture, and Floating-Point Unit Codesign of a Matrix Factorization Accelerator , 2014, IEEE Transactions on Computers.

[3]  Kenichi Miura,et al.  Performance Analysis of ClearSpeed's CSX600 Interconnects , 2009, 2009 IEEE International Symposium on Parallel and Distributed Processing with Applications.

[4]  Samuel Williams,et al.  Scientific Computing Kernels on the Cell Processor , 2007, International Journal of Parallel Programming.

[5]  S. K. Nandy,et al.  Micro-architectural Enhancements in Distributed Memory CGRAs for LU and QR Factorizations , 2015, 2015 28th International Conference on VLSI Design.

[6]  Jack Dongarra,et al.  Multithreading in the PLASMA Library , 2014 .

[7]  Robert A. van de Geijn,et al.  A Linear Algebra Core Design for Efficient Level-3 BLAS , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[8]  Jason N. Dale,et al.  Cell Broadband Engine Architecture and its first implementation - A performance view , 2007, IBM J. Res. Dev..

[9]  Jack Dongarra,et al.  LAPACK: a portable linear algebra library for high-performance computers , 1990, SC.

[10]  Stefania Perri,et al.  A matrix product accelerator for field programmable systems on chip , 2008, Microprocess. Microsystems.

[11]  Rafael C Núñez,et al.  LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators , 2009 .

[12]  Robert H. Halstead,et al.  Matrix Computations , 2011, Encyclopedia of Parallel Computing.

[13]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[14]  S. K. Nandy,et al.  Efficient QR Decomposition Using Low Complexity Column-wise Givens Rotation (CGR) , 2014, 2014 27th International Conference on VLSI Design and 2014 13th International Conference on Embedded Systems.

[15]  S. K. Nandy,et al.  Generic routing rules and a scalable access enhancement for the Network-on-Chip RECONNECT , 2009, 2009 IEEE International SOC Conference (SOCC).

[16]  Nicholas J. Higham,et al.  INVERSE PROBLEMS NEWSLETTER , 1991 .

[17]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[18]  S. K. Nandy,et al.  Achieving Efficient QR Factorization by Algorithm-Architecture Co-design of Householder Transformation , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).

[19]  Derek Chiou,et al.  On the asymptotic costs of multiplexer-based reconfigurability , 2012, DAC Design Automation Conference 2012.

[20]  Robert A. van de Geijn,et al.  Deriving dense linear algebra libraries , 2013, Formal Aspects of Computing.

[21]  S. K. Nandy,et al.  Efficient Realization of Table Look-Up Based Double Precision Floating Point Arithmetic , 2016, 2016 29th International Conference on VLSI Design and 2016 15th International Conference on Embedded Systems (VLSID).

[22]  Ahmed Hemani,et al.  39.9 GOPs/watt multi-mode CGRA accelerator for a multi-standard basestation , 2013, 2013 IEEE International Symposium on Circuits and Systems (ISCAS2013).

[23]  S. K. Nandy,et al.  Scalable and energy-efficient reconfigurable accelerator for column-wise givens rotation , 2014, 2014 22nd International Conference on Very Large Scale Integration (VLSI-SoC).

[24]  S. K. Nandy,et al.  Efficient and scalable CGRA-based implementation of Column-wise Givens Rotation , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[25]  S. K. Nandy,et al.  RECONNECT: A NoC for polymorphic ASICs using a low overhead single cycle router , 2008, 2008 International Conference on Application-Specific Systems, Architectures and Processors.

[26]  S. K. Nandy,et al.  Accelerating BLAS and LAPACK via Efficient Floating Point Architecture Design , 2017, Parallel Process. Lett..

[27]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[28]  Sasko Ristov,et al.  Superlinear Speedup for Matrix Multiplication in GPU Devices , 2012, ICT Innovations.

[29]  R. C. Whaley,et al.  Minimizing development and maintenance costs in supporting persistently optimized BLAS , 2005, Softw. Pract. Exp..

[30]  S. K. Nandy,et al.  REDEFINE: Runtime reconfigurable polymorphic ASIC , 2009, TECS.

[31]  Jack J. Dongarra,et al.  The LINPACK Benchmark: An Explanation , 1988, ICS.

[32]  Robert A. van de Geijn,et al.  Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[33]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[34]  Viktor K. Prasanna,et al.  Scalable and Modular Algorithms for Floating-Point Matrix Multiplication on Reconfigurable Computing Systems , 2007, IEEE Transactions on Parallel and Distributed Systems.

[35]  Yi Yang,et al.  BLASX: A High Performance Level-3 BLAS Library for Heterogeneous Multi-GPU Computing , 2015, ICS.

[36]  John D. Davis,et al.  BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[37]  Qian Wang,et al.  AUGEM: Automatically generate high performance Dense Linear Algebra kernels on x86 CPUs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[38]  Kenichi Miura,et al.  Performance Improvement Methodology for ClearSpeed's CSX600 , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[39]  Paolo Bonzini,et al.  EGRA: A Coarse Grained Reconfigurable Architectural Template , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[40]  Georgi Gaydadjiev,et al.  Architectural Exploration of the ADRES Coarse-Grained Reconfigurable Array , 2007, ARC.