Booster: An Accelerator for Gradient Boosting Decision Trees Training and Inference

Recent breakthroughs in machine learning (ML) have sparked hardware innovation for efficient execution of the emerging ML workloads. For instance, due to recent refine-ments and high-performance implementations, well-established gradient boosting decision tree (GBT) models (e.g., XGBoost) have demonstrated their dominance in commercially-important contexts, such as table-based datasets (e.g., relational databases and spreadsheets). Unfortunately, GBT training and inference are time-consuming (e.g., several hours of training for large datasets). Despite their importance, GBTs have not been targeted for hardware acceleration as much as neural networks. We propose Booster, a novel accelerator for GBTs based on their unique characteristics. We observe that the dominant steps of GBT training and inference (accounting for 90-98% of time) involve simple, fine-grained, independent operations on small-footprint data structures (e.g., histograms and shallow trees) - i.e., GBT is on-chip memory bandwidth-bound. Unfortunately, existing multicores and GPUs do not support massively-parallel data structure accesses that are irregular and data-dependent. By employing a scalable sea-of-small-SRAMs approach and an SRAM bandwidth-preserving mapping of data record fields to the SRAMs called group-by-field mapping, Booster achieves significantly more parallelism (e.g., 3200-way parallelism) than multicores and GPUs. In addition, Booster employs a redun-dant data representation that significantly lowers the memory bandwidth demand. Our simulations reveal that Booster achieves 11.4x and 6.4x speedups for training, and 45x and 22x (21x and 11x) speedups for offline (online) inference, over an ideal 32-core multicore and an ideal GPU, respectively. Based on ASIC synthesis of FPGA-validated RTL using 45 nm technology, we estimate a Booster chip to occupy 60 mm2 of area and dissipate 23 W when operating at 1-G Hz clock speed.

[1]  Tsuyoshi Isshiki,et al.  Scalable Full Hardware Logic Architecture for Gradient Boosted Tree Training , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  Bingsheng He,et al.  Exploiting GPUs for Efficient Gradient Boosting Decision Tree Training , 2019, IEEE Transactions on Parallel and Distributed Systems.

[3]  T. N. Vijaykumar,et al.  SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks , 2019, MICRO.

[4]  Takuya Tanaka,et al.  Efficient logic architecture in training gradient boosting decision tree for high-performance and edge computing , 2018, ArXiv.

[5]  Anna Veronika Dorogush,et al.  CatBoost: gradient boosting with categorical features support , 2018, ArXiv.

[6]  Kristof Van Laerhoven,et al.  Introducing WESAD, a Multimodal Dataset for Wearable Stress and Affect Detection , 2018, ICMI.

[7]  Yuval Elovici,et al.  N-BaIoT—Network-Based Detection of IoT Botnet Attacks Using Deep Autoencoders , 2018, IEEE Pervasive Computing.

[8]  Alberto Delmas,et al.  Bit-Tactical: Exploiting Ineffectual Computations in Convolutional Neural Networks: Which, Why, and How , 2018, ArXiv.

[9]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[10]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[11]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[12]  Eugenio Culurciello,et al.  Snowflake: An efficient hardware accelerator for convolutional neural networks , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[13]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[15]  Weiwei Deng,et al.  Model Ensemble for Click Prediction in Bing Search Ads , 2017, WWW.

[16]  Michael Ferdman,et al.  Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[17]  William J. Dally,et al.  Architecting an Energy-Efficient DRAM System for GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Thomas Keck,et al.  FastBDT: A speed-optimized and cache-friendly implementation of stochastic gradient-boosted decision trees for multivariate classification , 2016, ArXiv.

[22]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[24]  Joel Emer,et al.  Eyeriss: an Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks Accessed Terms of Use , 2022 .

[25]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[26]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Sachin S. Talathi,et al.  Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[28]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[29]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[30]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[32]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[34]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[35]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[36]  P. Baldi,et al.  Searching for exotic particles in high-energy physics with deep learning , 2014, Nature Communications.

[37]  Camille Couprie,et al.  Learning Hierarchical Features for Scene Labeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Tao Qin,et al.  Introducing LETOR 4.0 Datasets , 2013, ArXiv.

[39]  Alex M. Andrew,et al.  Boosting: Foundations and Algorithms , 2012 .

[40]  Luca Maria Gambardella,et al.  Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images , 2012, NIPS.

[41]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[42]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[43]  Horst Bischof,et al.  On robustness of on-line boosting - a competitive study , 2009, 2009 IEEE 12th International Conference on Computer Vision Workshops, ICCV Workshops.

[44]  Geoff Holmes,et al.  New ensemble methods for evolving data streams , 2009, KDD.

[45]  Horst Bischof,et al.  On-line Boosting and Vision , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[46]  Stuart J. Russell,et al.  Online bagging and boosting , 2005, 2005 IEEE International Conference on Systems, Man and Cybernetics.

[47]  J. Friedman Stochastic gradient boosting , 2002 .

[48]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[49]  Stuart J. Russell,et al.  Experimental comparisons of online and batch versions of bagging and boosting , 2001, KDD '01.

[50]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[51]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.