Sources of Variability in Large-scale Machine Learning Systems

We investigate sources of variability of a state-of-the-art distributed machine learning system for learning click and conversion prediction models for display advertising. We focus on three main sources of variability: asynchronous updates in the learning algorithm, downsampling of the data, and the non-deterministic order of examples received by each learning instance. We observe that some sources of variability can lead to significant differences between the models obtained and cause issues for, e.g., regression testing, debugging, and offline evaluation. We present effective solutions to stabilize the system and remove these sources of variability, thus fully solving the issues related to regression testing and to debugging. Moreover, we discuss potential limitations of this stabilization for drawing conclusions, in which case we may want to take the variability produced by the machine learning system into account in confidence intervals.

[1]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[2]  Flavian Vasile,et al.  Cost-sensitive Learning for Bidding in Online Advertising Auctions , 2016, ArXiv.

[3]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[4]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[5]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[6]  Olivier Chapelle,et al.  Offline Evaluation of Response Prediction in Online Advertising Auctions , 2015, WWW.

[7]  Gideon S. Mann,et al.  Distributed Training Strategies for the Structured Perceptron , 2010, NAACL.

[8]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[9]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[10]  Dong Yu,et al.  1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs , 2014, INTERSPEECH.

[11]  John Langford,et al.  A reliable effective terascale linear learning system , 2011, J. Mach. Learn. Res..

[12]  Aaron Q. Li,et al.  Parameter Server for Distributed Machine Learning , 2013 .

[13]  Dong Wang,et al.  Click-through Prediction for Advertising in Twitter Timeline , 2015, KDD.

[14]  Rómer Rosales,et al.  Simple and Scalable Response Prediction for Display Advertising , 2014, ACM Trans. Intell. Syst. Technol..

[15]  Yehuda Koren,et al.  Lessons from the Netflix prize challenge , 2007, SKDD.

[16]  Jimmy J. Lin,et al.  Large-scale machine learning at twitter , 2012, SIGMOD Conference.