CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Abstract Matrix computing is the core component of machine learning and artificial intelligence. Fast matrix computations can facilitate many large-scale computational projects greatly. Basic linear algebra subprograms (BLAS) are proposed, which classify different matrices and provide a standardized interface. Currently, the most commonly used heterogeneous computing platforms are central processing unit (CPU) and graphics processing unit (GPU). At present, BLAS has been implemented on both CPU and GPU. However, due to the different characteristics of algorithms and hardware, a particular matrix method should be designed for a particular processor. It is important to choose the right processor for a particular matrix computation. This paper first briefly reviews the BLAS, and then introduces architecture and optimization methods of CPU and GPU. The effect of different subroutines in BLAS is studied through experiments. Finally, we discuss the reasons and the processor selection scheme of matrix computations.

[1]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[2]  Pradeep Dubey,et al.  Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU , 2010, ISCA.

[3]  Darío Baptista,et al.  A survey of software and hardware use in artificial neural networks , 2013, Neural Computing and Applications.

[4]  Daisuke Takahashi,et al.  Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[5]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[6]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[7]  Jack J. Dongarra,et al.  An Improved Magma Gemm For Fermi Graphics Processing Units , 2010, Int. J. High Perform. Comput. Appl..

[8]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[9]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[10]  Keechul Jung,et al.  GPU implementation of neural networks , 2004, Pattern Recognit..

[11]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[12]  Feng Liu,et al.  Joint Weighted Nonnegative Matrix Factorization for Mining Attributed Graphs , 2017, PAKDD.

[13]  Alex Graves,et al.  Associative Long Short-Term Memory , 2016, ICML.

[14]  Yunming Ye,et al.  Multidimensional Latent Semantic Analysis Using Term Spatial Information , 2013, IEEE Transactions on Cybernetics.

[15]  Rafael Mayo,et al.  Evaluation and tuning of the Level 3 CUBLAS for graphics processors , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[16]  Tommy W. S. Chow,et al.  Object-Level Video Advertising: An Optimization Framework , 2017, IEEE Transactions on Industrial Informatics.

[17]  Yunming Ye,et al.  DeepFM: A Factorization-Machine based Neural Network for CTR Prediction , 2017, IJCAI.

[18]  Martin Lilleeng Sætra,et al.  Graphics processing unit (GPU) programming strategies and trends in GPU computing , 2013, J. Parallel Distributed Comput..

[19]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[20]  Haijun Zhang,et al.  Understanding Subtitles by Character-Level Sequence-to-Sequence Learning , 2017, IEEE Transactions on Industrial Informatics.

[21]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[22]  Bernard Ghanem,et al.  Representation learning with deep extreme learning machines for efficient image set classification , 2016, Neural Computing and Applications.

[23]  Darío Baptista,et al.  A survey of artificial neural network training tools , 2013, Neural Computing and Applications.

[24]  David J. Evans,et al.  The Parallel Solution of Triangular Systems of Equations , 1983, IEEE Transactions on Computers.

[25]  Naohito Nakasato,et al.  A fast GEMM implementation on the cypress GPU , 2011, PERV.

[26]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.