Unexpected challenges in large scale machine learning

In machine learning, scale adds complexity. The most obvious consequence of scale is that data takes longer to process. At certain points, however, scale makes trivial operations costly, thus forcing us to re-evaluate algorithms in light of the complexity of those operations. Here, we will discuss one important way a general large scale machine learning setting may differ from the standard supervised classification setting and show some the results of some preliminary experiments highlighting this difference. The results suggest that there is potential for significant improvement beyond obvious solutions.

[1]  Gerhard Widmer,et al.  Learning in the presence of concept drift and hidden contexts , 2004, Machine Learning.

[2]  Avrim Blum,et al.  On-line Algorithms in Machine Learning , 1996, Online Algorithms.

[3]  Peter L. Bartlett,et al.  Optimal Online Prediction in Adversarial Environments , 2010, Discovery Science.

[4]  Anton Dries,et al.  Adaptive concept drift detection , 2009, SDM.

[5]  Ron Meir,et al.  Nonparametric Time Series Prediction Through Adaptive Model Selection , 2000, Machine Learning.

[6]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[7]  M. Mohri,et al.  Stability Bounds for Stationary φ-mixing and β-mixing Processes , 2010 .

[8]  R. C. Bradley Basic Properties of Strong Mixing Conditions , 1985 .

[9]  Ramakrishnan Kannan,et al.  NIMBLE: a toolkit for the implementation of parallel data mining and machine learning algorithms on mapreduce , 2011, KDD.

[10]  Thomas Hofmann,et al.  Map-Reduce for Machine Learning on Multicore , 2007 .

[11]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[12]  Philip S. Yu,et al.  Mining concept-drifting data streams using ensemble classifiers , 2003, KDD '03.

[13]  Ian Buck,et al.  Fast Parallel Expectation Maximization for Gaussian Mixture Models on GPUs Using CUDA , 2009, 2009 11th IEEE International Conference on High Performance Computing and Communications.

[14]  Martin A. Zinkevich A Theoretical Analysis of a Warm Start Technique , 2012 .

[15]  John Langford,et al.  Parallel Online Learning , 2011, ArXiv.

[16]  Philipp Haller,et al.  Tools and Frameworks for Big Learning in Scala: Leveraging the Language for High Productivity and Performance , 2011, NIPS 2011.

[17]  Patrice Y. Simard,et al.  Using GPUs for machine learning algorithms , 2005, Eighth International Conference on Document Analysis and Recognition (ICDAR'05).

[18]  Roland Sauerbrey,et al.  Biography , 1992, Ann. Pure Appl. Log..

[19]  Amarnag Subramanya,et al.  Large-Scale Graph-based Transductive Inference , 2009 .

[20]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[21]  Nadi Tomeh,et al.  HadoopPerceptron: a Toolkit for Distributed Perceptron Training and Prediction with MapReduce , 2012, EACL.

[22]  Geoff Hulten,et al.  A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering , 2001, ICML.

[23]  Ping Li,et al.  Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW) , 2011, ArXiv.

[24]  Sudipto Guha,et al.  Approximation and streaming algorithms for histogram construction problems , 2006, TODS.

[25]  Joseph Gonzalez,et al.  Residual Splash for Optimally Parallelizing Belief Propagation , 2009, AISTATS.

[26]  Markus Weimer,et al.  Machine learning in ScalOps , a higher order cloud computing language , 2011 .

[27]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[28]  Yael Ben-Haim,et al.  A Streaming Parallel Decision Tree Algorithm , 2010, J. Mach. Learn. Res..

[29]  Peter J. Haas,et al.  Large-scale matrix factorization with distributed stochastic gradient descent , 2011, KDD.

[30]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[31]  Martin J. Wainwright,et al.  Randomized smoothing for (parallel) stochastic optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).