Optimal survey schemes for stochastic gradient descent with applications to M-estimation

Iterative stochastic approximation methods are widely used to solve M-estimation problems, in the context of predictive learning in particular. In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural and popular approach to gradient descent in this context consists in substituting the “full data” statistics with their counterparts based on subsamples picked at random of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods. Precisely, we prove that, in presence of some a priori information, one may significantly increase statistical accuracy in terms of limit variance, when choosing appropriate first order inclusion probabilities. These results are described by asymptotic theorems and are also supported by illustrative numerical experiments.

[1]  P. Robinson,et al.  On the Convergence of the Horvitz‐Thompson Estimator , 1982 .

[2]  Gonzalo Mateos,et al.  Distributed Sparse Linear Regression , 2010, IEEE Transactions on Signal Processing.

[3]  Thomas Lumley,et al.  Improved Horvitz–Thompson Estimation of Model Parameters from Two-phase Stratified Samples: Applications in Epidemiology , 2009, Statistics in biosciences.

[4]  C. Särndal,et al.  Calibration Estimators in Survey Sampling , 1992 .

[5]  S. Geer Empirical Processes in M-Estimation , 2000 .

[6]  Yves G. Berger,et al.  Rate of convergence to normal distribution for the Horvitz-Thompson estimator , 1998 .

[7]  M. Pelletier Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing , 1998 .

[8]  V. Koltchinskii Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0083.

[9]  Y. Berger Asymptotic consistency under large entropy sampling designs with unequal probabilities , 2011 .

[10]  Stéphan Clémençon,et al.  Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling , 2013, SDM.

[11]  V. Koltchinskii Rejoinder: Local Rademacher complexities and oracle inequalities in risk minimization , 2006, 0708.0135.

[12]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[13]  John C. Duchi,et al.  The Generalization Ability of Online Algorithms for Dependent Data , 2011, IEEE Transactions on Information Theory.

[14]  J. Hájek Asymptotic Theory of Rejective Sampling with Varying Probabilities from a Finite Population , 1964 .

[15]  V. Borkar Stochastic Approximation: A Dynamical Systems Viewpoint , 2008 .

[16]  D. Horvitz,et al.  A Generalization of Sampling Without Replacement from a Finite Universe , 1952 .

[17]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[18]  Emilio Parrado-Hernández,et al.  Distributed support vector machines , 2006, IEEE Trans. Neural Networks.

[19]  Stéphan Clémençon,et al.  Empirical processes in survey sampling , 2013 .

[20]  Richard D. Gill,et al.  Large sample theory of empirical distributions in biased sampling models , 1988 .

[21]  Dimitri P. Bertsekas,et al.  Convex Analysis and Optimization , 2003 .

[22]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[23]  Jon A Wellner,et al.  A Z-theorem with Estimated Nuisance Parameters and Correction Note for 'Weighted Likelihood for Semiparametric Models and Two-phase Stratified Samples, with Application to Cox Regression' , 2008, Scandinavian journal of statistics, theory and applications.

[24]  D R Perrott A further note on "limits for the detection of binaural beats". , 1970, The Journal of the Acoustical Society of America.

[25]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[26]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[27]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[28]  Stéphan Clémençon,et al.  Empirical Processes in Survey Sampling with (Conditional) Poisson Designs , 2017 .

[29]  Norman E. Breslow,et al.  A Z‐theorem with Estimated Nuisance Parameters and Correction Note for ‘Weighted Likelihood for Semiparametric Models and Two‐phase Stratified Samples, with Application to Cox Regression’ , 2008 .

[30]  Don R. Hush,et al.  Learning from dependent observations , 2007, J. Multivar. Anal..

[31]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[32]  François Laviolette,et al.  Domain-Adversarial Training of Neural Networks , 2015, J. Mach. Learn. Res..

[33]  A. Winsor Sampling techniques. , 2000, Nursing times.

[34]  P. Bickel Efficient and Adaptive Estimation for Semiparametric Models , 1993 .

[35]  Pascal Bianchi,et al.  On-line learning gossip algorithm in multi-agent systems with local decision rules , 2013, 2013 IEEE International Conference on Big Data.

[36]  Takumi Saegusa,et al.  WEIGHTED LIKELIHOOD ESTIMATION UNDER TWO-PHASE SAMPLING. , 2011, Annals of statistics.

[37]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[38]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[39]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[40]  K. Do,et al.  Efficient and Adaptive Estimation for Semiparametric Models. , 1994 .