Exploiting GPUs for Efficient Gradient Boosting Decision Tree Training

In this paper, we present a novel parallel implementation for training Gradient Boosting Decision Trees (GBDTs) on Graphics Processing Units (GPUs). Thanks to the excellent results on classification/regression and the open sourced libraries such as XGBoost, GBDTs have become very popular in recent years and won many awards in machine learning and data mining competitions. Although GPUs have demonstrated their success in accelerating many machine learning applications, it is challenging to develop an efficient GPU-based GBDT algorithm. The key challenges include irregular memory accesses, many sorting operations with small inputs and varying data parallel granularities in tree construction. To tackle these challenges on GPUs, we propose various novel techniques including (i) Run-length Encoding compression and thread/block workload dynamic allocation, (ii) data partitioning based on stable sort, and fast and memory efficient attribute ID lookup in node splitting, (iii) finding approximate split points using two-stage histogram building, (iv) building histograms with the aware of sparsity and exploiting histogram subtraction to reduce histogram building workload, (v) reusing intermediate training results for efficient gradient computation, and (vi) exploiting multiple GPUs to handle larger data sets efficiently. Our experimental results show that our algorithm named ThunderGBM can be 10x times faster than the state-of-the-art libraries (i.e., XGBoost, LightGBM and CatBoost) running on a relatively high-end workstation of 20 CPU cores. In comparison with the libraries on GPUs, ThunderGBM can handle higher dimensional problems which the libraries become extremely slow or simply fail. For the data sets the existing libraries on GPUs can handle, ThunderGBM achieves up to 10 times speedup on the same hardware, which demonstrates the significance of our GPU optimizations. Moreover, the models trained by ThunderGBM are identical to those trained by XGBoost, and have similar quality as those trained by LightGBM and CatBoost.

[1]  Inderjit S. Dhillon,et al.  Gradient Boosted Decision Trees for High Dimensional Sparse Output , 2017, ICML.

[2]  Håkan Grahn,et al.  CudaRF: A CUDA-based implementation of Random Forests , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[3]  Miriam Leeser,et al.  Accelerating K-Means clustering with parallel implementations and GPU computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[4]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[5]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[6]  Maya Gokhale,et al.  Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA? , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[7]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[8]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[9]  Shaohua Kevin Zhou,et al.  Fast boosting trees for classification, pose detection, and boundary detection on a GPU , 2011, CVPR 2011 WORKSHOPS.

[10]  Aziz Nasridinov,et al.  Decision tree construction on GPU: ubiquitous parallel computing approach , 2013, Computing.

[11]  Shyan-Ming Yuan,et al.  CUDT: A CUDA Based Decision Tree Algorithm , 2014, TheScientificWorldJournal.

[12]  Shreyasee Amin,et al.  Assessing fracture risk using gradient boosting machine (GBM) models , 2012, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[13]  Qiang Wu,et al.  McRank: Learning to Rank Using Multiple Classification and Gradient Boosting , 2007, NIPS.

[14]  Sebastian Nowozin,et al.  Decision Tree Fields: An Efficient Non-parametric Random Field Model for Image Labeling , 2013 .

[15]  Eibe Frank,et al.  Accelerating the XGBoost algorithm using GPU computing , 2017, PeerJ Comput. Sci..

[16]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[17]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[18]  Damjan Strnad,et al.  Parallel construction of classification trees on a GPU , 2016, Concurr. Comput. Pract. Exp..

[19]  Andreas Holzinger,et al.  Data Mining with Decision Trees: Theory and Applications , 2015, Online Inf. Rev..

[20]  HeBingsheng,et al.  Revisiting co-processing for hash joins on the coupled CPU-GPU architecture , 2013, VLDB 2013.

[21]  Bingsheng He,et al.  ThunderSVM: A Fast SVM Library on GPUs and CPUs , 2018, J. Mach. Learn. Res..

[22]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[23]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[24]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[25]  Bingsheng He,et al.  Efficient Gradient Boosted Decision Tree Training on GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[26]  C. Apte,et al.  Data mining with decision trees and decision rules , 1997, Future Gener. Comput. Syst..

[27]  Pranita D. Tamma,et al.  A Clinical Decision Tree to Predict Whether a Bacteremic Patient Is Infected With an Extended-Spectrum β-Lactamase-Producing Organism. , 2016, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[28]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[29]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[30]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[31]  Eibe Frank,et al.  Accelerating the XGBoost algorithm using GPU computing , 2017, PeerJ Comput. Sci..

[32]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.

[33]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[34]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[35]  Henrik Boström,et al.  gpuRF and gpuERT: Efficient and Scalable GPU Algorithms for Decision Tree Ensembles , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.