On the Optimality of Averaging in Distributed Statistical Learning

A common approach to statistical learning with big-data is to randomly split it among $m$ machines and learn the parameter of interest by averaging the $m$ individual estimates. In this paper, focusing on empirical risk minimization, or equivalently M-estimation, we study the statistical error incurred by this strategy. We consider two large-sample settings: First, a classical setting where the number of parameters $p$ is fixed, and the number of samples per machine $n\to\infty$. Second, a high-dimensional regime where both $p,n\to\infty$ with $p/n \to \kappa \in (0,1)$. For both regimes and under suitable assumptions, we present asymptotically exact expressions for this estimation error. In the fixed-$p$ setting, under suitable assumptions, we prove that to leading order averaging is as accurate as the centralized solution. We also derive the second order error terms, and show that these can be non-negligible, notably for non-linear models. The high-dimensional setting, in contrast, exhibits a qualitatively different behavior: data splitting incurs a first-order accuracy loss, which to leading order increases linearly with the number of machines. The dependence of our error approximations on the number of machines traces an interesting accuracy-complexity tradeoff, allowing the practitioner an informed choice on the number of machines to deploy. Finally, we confirm our theoretical analysis with several simulations.

[1]  P. J. Huber Robust Regression: Asymptotics, Conjectures and Monte Carlo , 1973 .

[2]  S. Orszag,et al.  Advanced Mathematical Methods For Scientists And Engineers , 1979 .

[3]  Hani Doss,et al.  Bias Reduction When There Is No Unbiased Estimate. , 1989 .

[4]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[5]  Aman Ullah,et al.  The second-order bias and mean squared error of nonlinear estimators , 1996 .

[6]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[7]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[8]  S. Orszag,et al.  Advanced mathematical methods for scientists and engineers I: asymptotic methods and perturbation theory. , 1999 .

[9]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[10]  S. Rosset,et al.  Piecewise linear regularized solution paths , 2007, 0708.2197.

[11]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[12]  Nathan Srebro,et al.  SVM optimization: inverse dependence on training set size , 2008, ICML '08.

[13]  Gideon S. Mann,et al.  Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models , 2009, NIPS.

[14]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[15]  Alexander J. Smola,et al.  Parallelized Stochastic Gradient Descent , 2010, NIPS.

[16]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[17]  John Langford,et al.  Scaling up machine learning: parallel and distributed approaches , 2011, KDD '11 Tutorials.

[18]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[19]  Alfred O. Hero,et al.  Distributed principal component analysis on networks via directed graphical models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[21]  P. Bickel,et al.  Optimal M-estimation in high-dimensional regression , 2013, Proceedings of the National Academy of Sciences.

[22]  P. Bickel,et al.  On robust regression with high-dimensional predictors , 2013, Proceedings of the National Academy of Sciences.

[23]  Noureddine El Karoui,et al.  Asymptotic behavior of unregularized and ridge-regularized high-dimensional robust regression estimators : rigorous results , 2013, 1311.2445.

[24]  Andrea Montanari,et al.  High dimensional robust M-estimation: asymptotic variance via approximate message passing , 2013, Probability Theory and Related Fields.

[25]  Tim Kraska,et al.  MLI: An API for Distributed Machine Learning , 2013, 2013 IEEE 13th International Conference on Data Mining.

[26]  Shie Mannor,et al.  Distributed Robust Learning , 2014, ArXiv.

[27]  Ohad Shamir,et al.  Communication-Efficient Distributed Optimization using an Approximate Newton-type Method , 2013, ICML.

[28]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[29]  Qiang Liu,et al.  Distributed Estimation, Information Loss and Exponential Families , 2014, NIPS.

[30]  Dan Crisan,et al.  A simple scheme for the parallelization of particle filters and its application to the tracking of complex stochastic systems , 2014, 1407.8071.

[31]  Qiang Liu,et al.  Communication-efficient sparse regression: a one-shot approach , 2015, ArXiv.

[32]  Stanislav Minsker Geometric median and robust estimation in Banach spaces , 2013, 1308.1334.

[33]  K. Kim Higher Order Bias Correcting Moment Equation for M-Estimation and Its Higher Order Efficiency , 2016 .

[34]  Daniel J. Hsu,et al.  Loss Minimization and Parameter Estimation with Heavy Tails , 2013, J. Mach. Learn. Res..