Scalability of Stochastic Gradient Descent based on "Smart" Sampling Techniques

Abstract Various appealing ideas have been recently proposed in the statistical literature to scale-up machine learning techniques and solve predictive/inferential problems from “Big Datasets”. Beyond the massively parallelized and distributed approaches exploiting hardware architectures and programming frameworks that have received increasing interest these last few years, several variants of the Stochastic Gradient Descent (SGD) method based on “smart” sampling procedures have been designed for accelerating the model fitting stage. Such techniques exploit either the form of the objective functional or some supposedly available auxiliary information, and have been thoroughly investigated from a theoretical viewpoint. Though attractive, such statistical methods must be also analyzed from a computational perspective, bearing the possible options offered by recent technological advances in mind. It is thus of vital importance to investigate how to implement efficiently these inferential principles in order to achieve the best trade-off between computational time and accuracy. In this paper, we explore the scalability of the SGD techniques introduced in [9,11] from an experimental perspective. Issues related to their implementation on distributed computing platforms such as Apache Spark are also discussed and experimental results based on large-scale real datasets are displayed in order to illustrate the relevance of the promoted approaches.

[1]  Léon Bottou,et al.  Stochastic Gradient Descent Tricks , 2012, Neural Networks: Tricks of the Trade.

[2]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[3]  Tong Zhang,et al.  Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , 2014, ArXiv.

[4]  Alan J. Lee,et al.  U-Statistics: Theory and Practice , 1990 .

[5]  Guillaume Papa,et al.  Optimal survey schemes for stochastic gradient descent with applications to M-estimation , 2015, ESAIM: Probability and Statistics.

[6]  Ron Bekkerman,et al.  Scaling up Machine Learning , 2011 .

[7]  Stéphan Clémençon,et al.  Scaling up M-estimation via sampling designs: The Horvitz-Thompson stochastic gradient descent , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[8]  Amaury Habrard,et al.  Robustness and generalization for metric learning , 2012, Neurocomputing.

[9]  Rong Jin,et al.  Regularized Distance Metric Learning: Theory and Algorithm , 2009, NIPS.

[10]  Carlos Guestrin,et al.  Distributed GraphLab : A Framework for Machine Learning and Data Mining in the Cloud , 2012 .

[11]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[12]  Stéphan Clémençon,et al.  Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling , 2013, SDM.

[13]  Michael I. Jordan On statistics, computation and scalability , 2013, ArXiv.

[14]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[15]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[16]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[17]  Marc Sebban,et al.  A Survey on Metric Learning for Feature Vectors and Structured Data , 2013, ArXiv.

[18]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[19]  Pascal Bianchi,et al.  On-line learning gossip algorithm in multi-agent systems with local decision rules , 2013, 2013 IEEE International Conference on Big Data.

[20]  Ohad Shamir,et al.  Optimal Distributed Online Prediction , 2011, ICML.

[21]  David Mort The Statistics , 2020, Sources of Non-Official UK Statistics.

[22]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[23]  Qiong Cao,et al.  Generalization bounds for metric and similarity learning , 2012, Machine Learning.

[24]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[25]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[26]  Martin J. Wainwright,et al.  Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling , 2010, IEEE Transactions on Automatic Control.

[27]  Andrea Montanari,et al.  Message-passing algorithms for compressed sensing , 2009, Proceedings of the National Academy of Sciences.