swGBDT: Efficient Gradient Boosted Decision Tree on Sunway Many-Core Processor

Gradient Boosted Decision Trees (GBDT) is a practical machine learning method, which has been widely used in various application fields such as recommendation system. Optimizing the performance of GBDT on heterogeneous many-core processors exposes several challenges such as designing efficient parallelization scheme and mitigating the latency of irregular memory access. In this paper, we propose swGBDT, an efficient GBDT implementation on Sunway processor. In swGBDT, we divide the 64 CPEs in a core group into multiple roles such as loader, saver and worker in order to hide the latency of irregular global memory access. In addition, we partition the data into two granularities such as block and tile to better utilize the LDM on each CPE for data caching. Moreover, we utilize register communication for collaboration among CPEs. Our evaluation with representative datasets shows that swGBDT achieves 4.6\(\times \) and 2\(\times \) performance speedup on average compared to the serial implementation on MPE and parallel XGBoost on CPEs respectively.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  Bingsheng He,et al.  Exploiting GPUs for Efficient Gradient Boosting Decision Tree Training , 2019, IEEE Transactions on Parallel and Distributed Systems.

[3]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[4]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[5]  Weifeng Liu,et al.  swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures , 2018, PPoPP.

[6]  Eibe Frank,et al.  Accelerating the XGBoost algorithm using GPU computing , 2017, PeerJ Comput. Sci..

[7]  Wei Ge,et al.  The Sunway TaihuLight supercomputer: system and applications , 2016, Science China Information Sciences.

[8]  Anna Veronika Dorogush,et al.  CatBoost: unbiased boosting with categorical features , 2017, NeurIPS.

[9]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[10]  Guangwen Yang,et al.  swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[11]  Sebastian Nowozin,et al.  Decision Tree Fields: An Efficient Non-parametric Random Field Model for Image Labeling , 2013 .

[12]  Gérard Biau,et al.  Accelerated gradient boosting , 2018, Machine Learning.

[13]  Guangwen Yang,et al.  swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  B. Roe,et al.  Boosted decision trees as an alternative to artificial neural networks for particle identification , 2004, physics/0408124.

[15]  Jack Dongarra,et al.  Sunway TaihuLight supercomputer makes its appearance , 2016 .

[16]  Chang Sun,et al.  Gradient Boosting Decision Tree-Based Method for Predicting Interactions Between Target Genes and Drugs , 2019, Front. Genet..

[17]  Bingsheng He,et al.  Efficient Gradient Boosted Decision Tree Training on GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[18]  Jianfeng Hu,et al.  Automated detection of driver fatigue based on EEG signals using gradient boosting decision tree model , 2018, Cognitive Neurodynamics.

[19]  Xin Liu,et al.  Towards Efficient SpMV on Sunway Manycore Architectures , 2018, ICS.

[20]  James Lin,et al.  Benchmarking SW26010 Many-Core Processor , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[21]  Depei Qian,et al.  Multi-role SpTRSV on Sunway Many-Core Architecture , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[22]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.