Parallel Training GBRT Based on KMeans Histogram Approximation for Big Data

Gradient Boosting Regression Tree GBRT, one of the state-of-the-art ranking algorithms widely used in industry, faces challenges in the big data era. With the rapid increase in the sizes of datasets, the iterative training process of GBRT becomes very time-consuming over large scale data. In this paper, we aim to speed up the training process of each tree in the GBRT framework. First, we propose a novel KMeans histogram building algorithm which has lower time complexity and is more efficient than the cutting-edge histogram building method. Further, we put forward an approximation algorithm by combining the kernel density estimation with the histogram technique to improve the accuracy. We conduct a variety of experiments on both the public Learning To RankLTR benchmark datasets and the large-scale real-world datasets from Baidu search engine. The experimental results show that our proposed parallel training algorithm outperforms the state-of-the-art parallel GBRT algorithm with near 2 times speedup and better accuracy. Also, our algorithm achieves the near-linear scalability.

[1]  Yi Chang,et al.  Yahoo! Learning to Rank Challenge Overview , 2010, Yahoo! Learning to Rank Challenge.

[2]  Rakesh Agrawal,et al.  SPRINT: A Scalable Parallel Classifier for Data Mining , 1996, VLDB.

[3]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[4]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[5]  Kilian Q. Weinberger,et al.  Web-Search Ranking with Initialized Gradient Boosted Regression Trees , 2010, Yahoo! Learning to Rank Challenge.

[6]  Tie-Yan Liu,et al.  Learning to rank: from pairwise approach to listwise approach , 2007, ICML '07.

[7]  Hongyuan Zha,et al.  A General Boosting Method and its Application to Learning Ranking Functions for Web Search , 2007, NIPS.

[8]  Rong Jin,et al.  Learning to Rank by Optimizing NDCG Measure , 2009, NIPS.

[9]  Jianfeng Gao,et al.  Ranking, Boosting, and Model Adaptation , 2008 .

[10]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[11]  Gregory N. Hullender,et al.  Learning to rank using gradient descent , 2005, ICML.

[12]  Tao Qin,et al.  LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval , 2007 .

[13]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[14]  Roberto J. Bayardo,et al.  PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce , 2009, Proc. VLDB Endow..

[15]  Hang Li,et al.  A Short Introduction to Learning to Rank , 2011, IEICE Trans. Inf. Syst..

[16]  Vipin Kumar,et al.  Parallel formulations of decision-tree classification algorithms , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[17]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .