Distributed tuning of machine learning algorithms using MapReduce Clusters

Obtaining the best accuracy in machine learning usually requires carefully tuning learning algorithm parameters for each problem. Parameter optimization is computationally challenging for learning methods with many hyperparameters. In this paper we show that MapReduce Clusters are particularly well suited for parallel parameter optimization. We use MapReduce to optimize regularization parameters for boosted trees and random forests on several text problems: three retrieval ranking problems and a Wikipedia vandalism problem. We show how model accuracy improves as a function of the percent of parameter space explored, that accuracy can be hurt by exploring parameter space too aggressively, and that there can be significant interaction between parameters that appear to be independent. Our results suggest that MapReduce is a two-edged sword: it makes parameter optimization feasible on a massive scale that would have been unimaginable just a few years ago, but also creates a new opportunity for overfitting that can reduce accuracy and lead to inferior learning parameters.

[1]  Tao Qin,et al.  LETOR: A benchmark collection for research on learning to rank for information retrieval , 2010, Information Retrieval.

[2]  J. Friedman Stochastic gradient boosting , 2002 .

[3]  Tim Menzies,et al.  A decision support tool for tuning parameters in a machine learning algorithm , 1997 .

[4]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[7]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[8]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[9]  Alexander Nareyek,et al.  Choosing search heuristics by non-stationary reinforcement learning , 2004 .

[10]  Jianfeng Gao,et al.  Ranking, Boosting, and Model Adaptation , 2008 .

[11]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[12]  B. Lang,et al.  Efficient optimization of support vector machine learning parameters for unbalanced datasets , 2006 .

[13]  Hisashi Kashima,et al.  Roughly balanced bagging for imbalanced data , 2009, Stat. Anal. Data Min..

[14]  Christopher C. Skiscim,et al.  Optimization by simulated annealing: A preliminary computational study for the TSP , 1983, WSC '83.

[15]  C. Weihs,et al.  Response Surface Methodology for Optimizing Hyper Parameters , 2006 .

[16]  Jaana Kekäläinen,et al.  IR evaluation methods for retrieving highly relevant documents , 2000, SIGIR '00.

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.