MLBench: Benchmarking Machine Learning Services Against Human Experts

Modern machine learning services and systems are complicated data systems --- the process of designing such systems is an art of compromising between functionality, performance, and quality. Providing different levels of system supports for different functionalities, such as automatic feature engineering, model selection and ensemble, and hyperparameter tuning, could improve the quality, but also introduce additional cost and system complexity. In this paper, we try to facilitate the process of asking the following type of questions: How much will the users lose if we remove the support of functionality x from a machine learning service? Answering this type of questions using existing datasets, such as the UCI datasets, is challenging. The main contribution of this work is a novel dataset, MLBench, harvested from Kaggle competitions. Unlike existing datasets, MLBench contains not only the raw features for a machine learning task, but also those used by the winning teams of Kaggle competitions. The winning features serve as a baseline of best human effort that enables multiple ways to measure the quality of machine learning services that cannot be supported by existing datasets, such as relative ranking on Kaggle and relative accuracy compared with best-effort systems. We then conduct an empirical study using MLBench to understand example machine learning services from Amazon and Microsoft Azure, and showcase how MLBench enables a comparative study revealing the strength and weakness of these existing machine learning services quantitatively and systematically. The full version of this paper can be found at arxiv.org/abs/1707.09562

[1]  Senén Barro,et al.  Do we need hundreds of classifiers to solve real world classification problems? , 2014, J. Mach. Learn. Res..

[2]  Wentao Wu,et al.  How good are machine learning clouds for binary classification with good features?: extended abstract , 2017, SoCC.

[3]  Colin Campbell,et al.  Bayes Point Machines , 2001, J. Mach. Learn. Res..

[4]  S. Hyakin,et al.  Neural Networks: A Comprehensive Foundation , 1994 .

[5]  Adam Silberstein,et al.  Benchmarking cloud serving systems with YCSB , 2010, SoCC '10.

[6]  Tian Li,et al.  Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads , 2017, Proc. VLDB Endow..

[7]  Pedro M. Domingos A few useful things to know about machine learning , 2012, Commun. ACM.

[8]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[11]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[12]  David J. DeWitt,et al.  The Wisconsin Benchmark: Past, Present, and Future , 1991, The Benchmark Handbook.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  James Stephen Marron,et al.  Distance‐weighted discrimination , 2015 .

[15]  Greg Ridgeway,et al.  Generalized Boosted Models: A guide to the gbm package , 2006 .

[16]  Sebastian Nowozin,et al.  Decision Jungles: Compact and Rich Models for Classification , 2013, NIPS.

[17]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[18]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[19]  Marco Lui Feature Stacking for Sentence Classification in Evidence-Based Medicine , 2012, ALTA.

[20]  Gang Lu,et al.  CloudRank-D: benchmarking and ranking cloud computing systems for data processing applications , 2012, Frontiers of Computer Science.

[21]  J. Ross Quinlan,et al.  Induction of Decision Trees , 1986, Machine Learning.

[22]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[23]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.