Learning with Non-Convex Truncated Losses by SGD

Learning with a {\it convex loss} function has been a dominating paradigm for many years. It remains an interesting question how non-convex loss functions help improve the generalization of learning with broad applicability. In this paper, we study a family of objective functions formed by truncating traditional loss functions, which is applicable to both shallow learning and deep learning. Truncating loss functions has potential to be less vulnerable and more robust to large noise in observations that could be adversarial. More importantly, it is a generic technique without assuming the knowledge of noise distribution. To justify non-convex learning with truncated losses, we establish excess risk bounds of empirical risk minimization based on truncated losses for heavy-tailed output, and statistical error of an approximate stationary point found by stochastic gradient descent (SGD) method. Our experiments for shallow and deep learning for regression with outliers, corrupted data and heavy-tailed noise further justify the proposed method.

[1]  Robert C. Williamson,et al.  From Stochastic Mixability to Fast Rates , 2014, NIPS.

[2]  P. J. Huber Robust Estimation of a Location Parameter , 1964 .

[3]  Nicolò Cesa-Bianchi,et al.  Bandits With Heavy Tail , 2012, IEEE Transactions on Information Theory.

[4]  Daniel J. Hsu,et al.  Loss Minimization and Parameter Estimation with Heavy Tails , 2013, J. Mach. Learn. Res..

[5]  S. Mendelson,et al.  Learning subgaussian classes : Upper and minimax bounds , 2013, 1305.4825.

[6]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[7]  Le Chang,et al.  Robust Lasso Regression Using Tukey's Biweight Criterion , 2018, Technometrics.

[8]  Nassir Navab,et al.  Robust Optimization for Deep Regression , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[9]  Shahar Mendelson,et al.  Regularization and the small-ball method II: complexity dependent error rates , 2016, J. Mach. Learn. Res..

[10]  Mehryar Mohri,et al.  Relative deviation learning bounds and generalization with unbounded loss functions , 2013, Annals of Mathematics and Artificial Intelligence.

[11]  Tianbao Yang,et al.  First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time , 2017, NeurIPS.

[12]  Frederick R. Forst,et al.  On robust estimation of the location parameter , 1980 .

[13]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[14]  M. Talagrand The Generic chaining : upper and lower bounds of stochastic processes , 2005 .

[15]  S. Mendelson,et al.  Regularization and the small-ball method I: sparse recovery , 2016, 1601.05584.

[16]  Felipe Cucker,et al.  On the mathematical foundations of learning , 2001 .

[17]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[18]  O. Catoni Challenging the empirical mean and empirical variance: a deviation study , 2010, 1009.2048.

[19]  V. Yohai,et al.  Robust Statistics: Theory and Methods , 2006 .

[20]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[21]  Zhi-Hua Zhou,et al.  𝓁1-regression with Heavy-tailed Distributions , 2018, ArXiv.

[22]  Naomi S. Altman,et al.  Quantile regression , 2019, Nature Methods.

[23]  Po-Ling Loh,et al.  Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression , 2012, 2012 IEEE International Symposium on Information Theory Proceedings.

[24]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[25]  S. Geer Applications of empirical process theory , 2000 .

[26]  P. Bartlett,et al.  Empirical minimization , 2006 .

[27]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[28]  Rong Jin,et al.  Empirical Risk Minimization for Stochastic Convex Optimization: $O(1/n)$- and $O(1/n^2)$-type of Risk Bounds , 2017, COLT.

[29]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[30]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[31]  Yin Chen,et al.  Fused sparsity and robust estimation for linear models with unknown variance , 2012, NIPS.

[32]  Po-Ling Loh,et al.  Statistical consistency and asymptotic normality for high-dimensional robust M-estimators , 2015, ArXiv.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  S. Mendelson Learning without concentration for general loss functions , 2014, 1410.3192.

[35]  Trac D. Tran,et al.  Robust Lasso With Missing and Grossly Corrupted Observations , 2011, IEEE Transactions on Information Theory.

[36]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[37]  Binh T. Nguyen,et al.  Fast learning rates with heavy-tailed losses , 2016, NIPS.

[38]  V. Koltchinskii,et al.  Oracle inequalities in empirical risk minimization and sparse recovery problems , 2011 .

[39]  Peter Grünwald,et al.  Fast Rates for General Unbounded Loss Functions: From ERM to Generalized Bayes , 2016, J. Mach. Learn. Res..

[40]  R. Gnanadesikan,et al.  Probability plotting methods for the analysis for the analysis of data , 1968 .

[41]  Daniel J. Hsu,et al.  Heavy-tailed regression with a generalized median-of-means , 2014, ICML.

[42]  Michael J. Black,et al.  The Robust Estimation of Multiple Motions: Parametric and Piecewise-Smooth Flow Fields , 1996, Comput. Vis. Image Underst..

[43]  Trac D. Tran,et al.  Exact Recoverability From Dense Corrupted Observations via $\ell _{1}$-Minimization , 2011, IEEE Transactions on Information Theory.

[44]  Jean-Yves Audibert,et al.  Robust linear least squares regression , 2010, 1010.0074.

[45]  Prateek Jain,et al.  Consistent Robust Regression , 2017, NIPS.

[46]  Prateek Jain,et al.  Robust Regression via Hard Thresholding , 2015, NIPS.

[47]  Shahar Mendelson,et al.  Learning without Concentration , 2014, COLT.

[48]  G. Lugosi,et al.  Empirical risk minimization for heavy-tailed losses , 2014, 1406.2462.

[49]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[50]  Shahar Mendelson,et al.  General nonexact oracle inequalities for classes with a subexponential envelope , 2012, 1206.0871.

[51]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[52]  Joachim M. Buhmann,et al.  Fast and Robust Least Squares Estimation in Corrupted Linear Models , 2014, NIPS.

[53]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[54]  Dmitriy Drusvyatskiy,et al.  Stochastic subgradient method converges at the rate O(k-1/4) on weakly convex functions , 2018, ArXiv.

[55]  Karthik Sridharan,et al.  Learning with Square Loss: Localization through Offset Rademacher Complexity , 2015, COLT.

[56]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.