Gradient Boosted Decision Trees for High Dimensional Sparse Output

In this paper, we study the gradient boosted decision trees (GBDT) when the output space is high dimensional and sparse. For example, in multilabel classification, the output space is a L-dimensional 0/1 vector, where L is number of labels that can grow to millions and beyond in many modern applications. We show that vanilla GBDT can easily run out of memory or encounter near-forever running time in this regime, and propose a new GBDT variant, GBDT-SPARSE, to resolve this problem by employing L0 regularization. We then discuss in detail how to utilize this sparsity to conduct GBDT training, including splitting the nodes, computing the sparse residual, and predicting in sub-linear time. Finally, we apply our algorithm to extreme multilabel classification problems, and show that the proposed GBDT-SPARSE achieves an order of magnitude improvements in model size and prediction time over existing methods, while yielding similar performance.

[1]  Prateek Jain,et al.  Sparse Local Embeddings for Extreme Multi-label Classification , 2015, NIPS.

[2]  Tie-Yan Liu,et al.  A Communication-Efficient Parallel Algorithm for Decision Tree , 2016, NIPS.

[3]  Ben Taskar,et al.  Efficient Second-Order Gradient Boosting for Conditional Random Fields , 2015, AISTATS.

[4]  Manik Varma,et al.  FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning , 2014, KDD.

[5]  Bernhard Schölkopf,et al.  DiSMEC: Distributed Sparse Machines for Extreme Multi-label Classification , 2016, WSDM.

[6]  Eyke Hüllermeier,et al.  Extreme F-measure Maximization using Sparse Probability Estimates , 2016, ICML.

[7]  Joseph K. Bradley,et al.  Yggdrasil: An Optimized System for Training Deep Decision Trees at Scale , 2016, NIPS.

[8]  Inderjit S. Dhillon,et al.  Goal-Directed Inductive Matrix Completion , 2016, KDD.

[9]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[10]  John Langford,et al.  Logarithmic Time Online Multiclass prediction , 2015, NIPS.

[11]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[12]  Inderjit S. Dhillon,et al.  Multi-Scale Spectral Decomposition of Massive Graphs , 2014, NIPS.

[13]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[14]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[15]  Manik Varma,et al.  Multi-label learning with millions of labels: recommending advertiser bid phrases for web pages , 2013, WWW.

[16]  J. Friedman Stochastic gradient boosting , 2002 .

[17]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[18]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[19]  Pradeep Ravikumar,et al.  PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification , 2016, ICML.

[20]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[21]  Inderjit S. Dhillon,et al.  Large-scale Multi-label Learning with Missing Labels , 2013, ICML.