Efficient Gradient Boosted Decision Tree Training on GPUs

In this paper, we present a novel parallel implementation for training Gradient Boosting Decision Trees (GBDTs) on Graphics Processing Units (GPUs). Thanks to the wide use of the open sourced XGBoost library, GBDTs have become very popular in recent years and won many awards in machine learning and data mining competitions. Although GPUs have demonstrated their success in accelerating many machine learning applications, there are a series of key challenges of developing a GPU-based GBDT algorithm, including irregular memory accesses, many small sorting operations and varying data parallel granularities in tree construction. To tackle these challenges on GPUs, we propose various novel techniques (including Run-length Encoding compression and thread/block workload dynamic allocation, and reusing intermediate training results for efficient gradient computation). Our experimental results show that our algorithm named GPU-GBDT is often 10 to 20 times faster than the sequential version of XGBoost, and achieves 1.5 to 2 times speedup over a 40 threaded XGBoost running on a relatively high-end workstation of 20 CPU cores. Moreover, GPU-GBDT outperforms its CPU counterpart by 2 to 3 times in terms of performance-price ratio.

[1]  Eibe Frank,et al.  Accelerating the XGBoost algorithm using GPU computing , 2017, PeerJ Comput. Sci..

[2]  Shyan-Ming Yuan,et al.  CUDT: A CUDA Based Decision Tree Algorithm , 2014, TheScientificWorldJournal.

[3]  Xiaojiao Yu Machine learning application in online lending risk prediction , 2017 .

[4]  Pranita D. Tamma,et al.  A Clinical Decision Tree to Predict Whether a Bacteremic Patient Is Infected With an Extended-Spectrum β-Lactamase-Producing Organism. , 2016, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[5]  Nathan Bell,et al.  Thrust: A Productivity-Oriented Library for CUDA , 2012 .

[6]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[7]  Shie Mannor,et al.  A Tutorial on the Cross-Entropy Method , 2005, Ann. Oper. Res..

[8]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[9]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[10]  Damjan Strnad,et al.  Parallel construction of classification trees on a GPU , 2016, Concurr. Comput. Pract. Exp..

[11]  Håkan Grahn,et al.  CudaRF: A CUDA-based implementation of Random Forests , 2011, 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA).

[12]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[13]  Maya Gokhale,et al.  Accelerating a Random Forest Classifier: Multi-Core, GP-GPU, or FPGA? , 2012, 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines.

[14]  Bingsheng He,et al.  Relational joins on graphics processors , 2008, SIGMOD Conference.

[15]  Shaohua Kevin Zhou,et al.  Fast boosting trees for classification, pose detection, and boundary detection on a GPU , 2011, CVPR 2011 WORKSHOPS.

[16]  Qiang Wu,et al.  McRank: Learning to Rank Using Multiple Classification and Gradient Boosting , 2007, NIPS.

[17]  Shreyasee Amin,et al.  Assessing fracture risk using gradient boosting machine (GBM) models , 2012, Journal of bone and mineral research : the official journal of the American Society for Bone and Mineral Research.

[18]  Henrik Boström,et al.  gpuRF and gpuERT: Efficient and Scalable GPU Algorithms for Decision Tree Ensembles , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[19]  Mahmood Yousefi-Azar,et al.  Fast, Automatic and Scalable Learning to Detect Android Malware , 2017, ICONIP.

[20]  Inderjit S. Dhillon,et al.  Gradient Boosted Decision Trees for High Dimensional Sparse Output , 2017, ICML.

[21]  Miriam Leeser,et al.  Accelerating K-Means clustering with parallel implementations and GPU computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[22]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[23]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[24]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[25]  Aziz Nasridinov,et al.  Decision tree construction on GPU: ubiquitous parallel computing approach , 2013, Computing.

[26]  Sebastian Nowozin,et al.  Decision Tree Fields: An Efficient Non-parametric Random Field Model for Image Labeling , 2013 .

[27]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[28]  Jianlong Zhong,et al.  Medusa: Simplified Graph Processing on GPUs , 2014, IEEE Transactions on Parallel and Distributed Systems.

[29]  Toby Sharp,et al.  Implementing Decision Trees and Forests on a GPU , 2008, ECCV.