Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

AbstractRecent integrated circuit technologies have opened the possibility to design parallel architectures with hundreds of cores on a single chip. The design space of these parallel architectures is huge with many architectural options. Exploring the design space gets even more difficult if, beyond performance and area, we also consider extra metrics like performance and area efficiency, where the designer tries to design the architecture with the best performance per chip area and the best sustainable performance. In this paper we present an algorithm-oriented approach to design a many-core architecture. Instead of doing the design space exploration of the many core architecture based on the experimental execution results of a particular benchmark of algorithms, our approach is to make a formal analysis of the algorithms considering the main architectural aspects and to determine how each particular architectural aspect is related to the performance of the architecture when running an algorithm or set of algorithms. The architectural aspects considered include the number of cores, the local memory available in each core, the communication bandwidth between the many-core architecture and the external memory and the memory hierarchy. To exemplify the approach we did a theoretical analysis of a dense matrix multiplication algorithm and determined an equation that relates the number of execution cycles with the architectural parameters. Based on this equation a many-core architecture has been designed. The results obtained indicate that a 100 mm2 integrated circuit design of the proposed architecture, using a 65 nm technology, is able to achieve 464 GFLOPs (double precision floating-point) for a memory bandwidth of 16 GB/s. This corresponds to a performance efficiency of 71 %. Considering a 45 nm technology, a 100 mm2 chip attains 833 GFLOPs which corresponds to 84 % of peak performance These figures are better than those obtained by previous many-core architectures, except for the area efficiency which is limited by the lower memory bandwidth considered. The results achieved are also better than those of previous state-of-the-art many-cores architectures designed specifically to achieve high performance for matrix multiplication.

[1]  Bishop Brock,et al.  Architecting for power management: The IBM® POWER7™ approach , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[2]  Martin Berggren,et al.  Hybrid differentiation strategies for simulation and analysis of applications in C++ , 2008, TOMS.

[3]  Ninghui Sun,et al.  Fast implementation of DGEMM on Fermi GPU , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[4]  H. Peter Hofstee,et al.  Power efficient processor architecture and the cell processor , 2005, 11th International Symposium on High-Performance Computer Architecture.

[5]  Gene H. Golub,et al.  Scientific computing: an introduction with parallel computing , 1993 .

[6]  Dhiraj K. Pradhan,et al.  A Routing-Aware ILS Design Technique , 2011, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[7]  Viktor K. Prasanna,et al.  Energy efficient architecture for matrix multiplication on FPGAs , 2013, 2013 23rd International Conference on Field programmable Logic and Applications.

[8]  Robert A. van de Geijn,et al.  Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures , 2012, IEEE Transactions on Computers.

[9]  Viktor K. Prasanna,et al.  High-Performance Designs for Linear Algebra Operations on Reconfigurable Hardware , 2008, IEEE Transactions on Computers.

[10]  Brett M. Bode,et al.  Performance analysis of memory transfers and GEMM subroutines on NVIDIA Tesla GPU cluster , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[11]  Sanjay Ranka,et al.  Energy and performance tradeoffs for matrix multiplication on multicore machines , 2012, 2012 International Green Computing Conference (IGCC).

[12]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[13]  Philip Heng Wai Leong,et al.  A Model for Matrix Multiplication Performance on FPGAs , 2011, 2011 21st International Conference on Field Programmable Logic and Applications.

[14]  Yong Dou,et al.  64-bit floating-point FPGA matrix multiplication , 2005, FPGA '05.

[15]  S. Borkar,et al.  An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS , 2008, IEEE Journal of Solid-State Circuits.

[16]  Thorsten Grotker,et al.  System Design with SystemC , 2002 .

[17]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[18]  Robert A. van de Geijn,et al.  Anatomy of high-performance matrix multiplication , 2008, TOMS.