On optimizing machine learning workloads via kernel fusion

Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.

[1]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[2]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[3]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[4]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[5]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[6]  Shirish Tatikonda,et al.  SystemML: Declarative machine learning on MapReduce , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[7]  B. Ribeiro,et al.  GPUMLib : An Efficient Open-Source GPU Machine Learning Library , 2011 .

[8]  Noel Lopes,et al.  GPUMLib: A new Library to combine Machine Learning algorithms with Graphics Processing Units , 2010, 2010 10th International Conference on Hybrid Intelligent Systems.

[9]  John Canny,et al.  BIDMach: Large-scale Learning with Zero Memory Allocation , 2013 .

[10]  Roy H. Campbell,et al.  A Parallel Implementation of K-Means Clustering on GPUs , 2008, PDPTA.

[11]  Michael Garland,et al.  Implementing sparse matrix-vector multiplication on throughput-oriented processors , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Tao Wang,et al.  Deep learning with COTS HPC systems , 2013, ICML.

[13]  Olivier Chapelle,et al.  Training a Support Vector Machine in the Primal , 2007, Neural Computation.

[14]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[15]  Chia-Hua Ho,et al.  Large-scale linear support vector regression , 2012, J. Mach. Learn. Res..

[16]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[17]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[18]  Christopher Ré,et al.  Materialization optimizations for feature selection workloads , 2014, SIGMOD Conference.

[19]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[20]  Shirish Tatikonda,et al.  Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML , 2014, Proc. VLDB Endow..

[21]  John F. Canny,et al.  Big data analytics with small footprint: squaring the cloud , 2013, KDD.

[22]  Kurt Keutzer,et al.  Fast support vector machine training and classification on graphics processors , 2008, ICML '08.

[23]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.