Analyzing the energy-efficiency of sparse matrix multiplication on heterogeneous systems: A comparative study of GPU, Xeon Phi and FPGA

Hardware accelerators have evolved as the most prominent vehicle to meet the demanding performance and energy-efficiency constraints of modern computer systems. The prevalent type of hardware accelerators in the high-performance computing domain are PCIe attached co-processors to which the CPU can offload compute intensive tasks. In this paper, we analyze the performance, power, and energy-efficiency of such accelerators for sparse matrix multiplication kernels. Improving the efficiency for sparse matrix operations is of eminent importance since they work at the core of graph analytics algorithms which are in turn key to many big data knowledge discovery workloads. Our study involves GPU, Xeon Phi, and FPGA co-processors to embrace the vast majority of hardware accelerators applied in modern HPC systems. In order to compare the devices on the same level of implementation quality we apply vendor optimized libraries for which published results exist. From our experiments we deduce that none of the compared devices generally dominates in terms of energy-efficiency and that the optimal solutions depends on the actual sparse matrix data, data transfer requirements and on the applied efficiency metric. We also show that a combined use of multiple accelerators can further improve the system's performance and efficiency by up to 11% and 18%, respectively.

[1]  Mark Horowitz,et al.  Energy dissipation in general purpose microprocessors , 1996, IEEE J. Solid State Circuits.

[2]  John D. Davis,et al.  BLAS Comparison on FPGA, CPU and GPU , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[3]  Wu-chun Feng,et al.  Trends in energy-efficient computing: A perspective from the Green500 , 2013, 2013 International Green Computing Conference Proceedings.

[4]  Tony M. Brewer,et al.  Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[5]  Martin Burtscher,et al.  Measuring GPU Power with the K20 Built-in Sensor , 2014, GPGPU@ASPLOS.

[6]  Alan D. George,et al.  Performance and productivity evaluation of hybrid-threading HLS versus HDLs , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[7]  Yves Lhuillier,et al.  A unified methodology for a fast benchmarking of parallel architecture , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[8]  Gerhard Wellein,et al.  A Unified Sparse Matrix Data Format for Efficient General Sparse Matrix-Vector Multiplication on Modern Processors with Wide SIMD Units , 2013, SIAM J. Sci. Comput..

[9]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[10]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[11]  Ümit V. Çatalyürek,et al.  Performance Evaluation of Sparse Matrix Multiplication Kernels on Intel Xeon Phi , 2013, PPAM.

[12]  Laurent Lefèvre,et al.  A survey on techniques for improving the energy efficiency of large-scale distributed systems , 2014, ACM Comput. Surv..

[13]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[14]  Wayne Luk,et al.  Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study , 2010, IEEE Transactions on Computers.

[15]  Eric S. Chung,et al.  Towards a Universal FPGA Matrix-Vector Multiplication Architecture , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[16]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[17]  Feng Zhao,et al.  Energy aware consolidation for cloud computing , 2008, CLUSTER 2008.

[18]  Greg Brown,et al.  A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications , 2012, FPGA '12.

[19]  Rahul Khanna,et al.  RAPL: Memory power estimation and capping , 2010, 2010 ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED).

[20]  M. Horowitz,et al.  Low-power digital design , 1994, Proceedings of 1994 IEEE Symposium on Low Power Electronics.

[21]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[22]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[23]  Pat Hanrahan,et al.  A Streaming Supercomputer , 2001 .

[24]  FengWu-chun,et al.  The Green500 List , 2007 .

[25]  Joseph L. Greathouse,et al.  Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[27]  Shengen Yan,et al.  yaSpMV: yet another SpMV framework on GPUs , 2014, PPoPP.

[28]  Alan D. George,et al.  Comparative analysis of OpenCL vs. HDL with image-processing kernels on Stratix-V FPGA , 2015, 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP).

[29]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[30]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[31]  Jing Zhang,et al.  OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[32]  Heiner Giefers,et al.  Analyzing the energy-efficiency of dense linear algebra kernels by power-profiling a hybrid CPU/FPGA system , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[33]  Yan Zhang,et al.  FPGA vs. GPU for sparse matrix vector multiply , 2009, 2009 International Conference on Field-Programmable Technology.

[34]  Constantine Bekas,et al.  Stochastic Matrix-Function Estimators: Scalable Big-Data Kernels with High Performance , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[35]  Vojin G. Oklobdzija The Computer Engineering Handbook , 2007 .

[36]  Kurt Keutzer,et al.  clSpMV: A Cross-Platform OpenCL SpMV Framework on GPUs , 2012, ICS '12.

[37]  Luiz André Barroso,et al.  The Case for Energy-Proportional Computing , 2007, Computer.

[38]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[39]  John Wawrzynek,et al.  Bridging the GPGPU-FPGA efficiency gap , 2011, FPGA '11.

[40]  Nectarios Koziris,et al.  Performance evaluation of the sparse matrix-vector multiplication on modern architectures , 2009, The Journal of Supercomputing.

[41]  Margo I. Seltzer,et al.  The case for application-specific benchmarking , 1999, Proceedings of the Seventh Workshop on Hot Topics in Operating Systems.

[42]  Jason D. Bakos,et al.  A Sparse Matrix Personality for the Convey HC-1 , 2011, 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines.

[43]  Xing Liu,et al.  Efficient sparse matrix-vector multiplication on x86-based many-core processors , 2013, ICS '13.

[44]  Rolf Clauberg,et al.  4.4 Energy-efficient microserver based on a 12-core 1.8GHz 188K-CoreMark 28nm bulk CMOS 64b SoC for big-data applications with 159GB/S/L memory bandwidth system density , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[45]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[46]  Jeffrey Stuecheli,et al.  CAPI: A Coherent Accelerator Processor Interface , 2015, IBM J. Res. Dev..

[47]  Shuaiwen Song,et al.  The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[48]  Yu Ting Chen,et al.  A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[49]  Andreas Moshovos,et al.  Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).