Robust Regression via Model Based Methods

The mean squared error loss is widely used in many applications, including auto-encoders, multi-target regression, and matrix factorization, to name a few. Despite computational advantages due to its differentiability, it is not robust to outliers. In contrast, `p norms are known to be robust, but cannot be optimized via, e.g., stochastic gradient descent, as they are non-differentiable. We propose an algorithm inspired by so-called model-based optimization (MBO) [35, 36], which replaces a non-convex objective with a convex model function and alternates between optimizing the model function and updating the solution. We apply this to robust regression, proposing SADM, a stochastic variant of the Online Alternating Direction Method of Multipliers (OADM) [48] to solve the inner optimization in MBO. We show that SADM converges with the rate O(log T/T ). Finally, we demonstrate experimentally (a) the robustness of `p norms to outliers and (b) the efficiency of our proposed model-based algorithms in comparison with gradient methods on autoencoders and multi-target regression.

[1]  Feiping Nie,et al.  Efficient and Robust Feature Selection via Joint ℓ2, 1-Norms Minimization , 2010, NIPS.

[2]  Anders P. Eriksson,et al.  Efficient computation of robust low-rank matrix approximations in the presence of missing data using the L1 norm , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Nojun Kwak,et al.  Principal Component Analysis Based on L1-Norm Maximization , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Korris Fu-Lai Chung,et al.  The l2, 1-Norm Stacked Robust Autoencoders for Domain Adaptation , 2016, AAAI.

[5]  Arindam Banerjee,et al.  Online Alternating Direction Method (longer version) , 2013, ArXiv.

[6]  Xiaoming Yuan,et al.  Recovering Low-Rank and Sparse Components of Matrices from Incomplete and Noisy Observations , 2011, SIAM J. Optim..

[7]  Ana de Almeida,et al.  Nonnegative Matrix Factorization , 2018 .

[8]  Eyke Hüllermeier,et al.  Multi-target prediction: a unifying view on problems and methods , 2018, Data Mining and Knowledge Discovery.

[9]  Mikael Johansson,et al.  Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization , 2020, ICML.

[10]  Angshul Majumdar,et al.  Stacked Robust Autoencoder for Classification , 2016, ICONIP.

[11]  Arindam Banerjee,et al.  Online Alternating Direction Method , 2012, ICML.

[12]  Jieping Ye,et al.  Efficient L1/Lq Norm Regularization , 2010, ArXiv.

[13]  Grigorios Tsoumakas,et al.  Multi-target regression via input space expansion: treating targets as inputs , 2012, Machine Learning.

[14]  John Wright,et al.  RASL: Robust alignment by sparse and low-rank decomposition for linearly correlated images , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[15]  Peter Filzmoser,et al.  Robust Factorization of a Data Matrix , 1998, COMPSTAT.

[16]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[17]  Mehran Mesbahi,et al.  Online distributed ADMM via dual averaging , 2014, 53rd IEEE Conference on Decision and Control.

[18]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[19]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[20]  Chris H. Q. Ding,et al.  Robust nonnegative matrix factorization using L21-norm , 2011, CIKM '11.

[21]  Dmitriy Drusvyatskiy,et al.  Efficiency of minimizing compositions of convex functions and smooth maps , 2016, Math. Program..

[22]  s-taiji Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method , 2013 .

[23]  C. Michelot A finite algorithm for finding the projection of a point onto the canonical simplex of ∝n , 1986 .

[24]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[25]  Grigorios Tsoumakas,et al.  Multi-target regression via input space expansion: treating targets as inputs , 2012, Machine Learning.

[26]  Lei Shi,et al.  Robust Multiple Kernel K-means Using L21-Norm , 2015, IJCAI.

[27]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[28]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[29]  Mike E. Davies,et al.  Iterative Hard Thresholding for Compressed Sensing , 2008, ArXiv.

[30]  Hédy Attouch,et al.  Proximal Alternating Minimization and Projection Methods for Nonconvex Problems: An Approach Based on the Kurdyka-Lojasiewicz Inequality , 2008, Math. Oper. Res..

[31]  Stratis Ioannidis,et al.  Massively Distributed Graph Distances , 2020, IEEE Transactions on Signal and Information Processing over Networks.

[32]  Jean-Philippe Vial,et al.  Strong and Weak Convexity of Sets and Functions , 1983, Math. Oper. Res..

[33]  Mohamed-Jalal Fadili,et al.  Non-smooth Non-convex Bregman Minimization: Unification and New Algorithms , 2017, Journal of Optimization Theory and Applications.

[34]  Feng Ruan,et al.  Stochastic Methods for Composite and Weakly Convex Optimization Problems , 2017, SIAM J. Optim..

[35]  Alexander G. Gray,et al.  Stochastic Alternating Direction Method of Multipliers , 2013, ICML.

[36]  Scott Pesme,et al.  Online Robust Regression via SGD on the l1 loss , 2020, NeurIPS.

[37]  Stephen J. Wright,et al.  A proximal method for composite minimization , 2008, Mathematical Programming.

[38]  ChengXiang Zhai,et al.  Robust Unsupervised Feature Selection , 2013, IJCAI.

[39]  Peter Ochs,et al.  Model Function Based Conditional Gradient Method with Armijo-like Line Search , 2019, ICML.

[40]  Yuanyuan Liu,et al.  Accelerated Variance Reduced Stochastic ADMM , 2017, AAAI.

[41]  Damek Davis,et al.  Proximally Guided Stochastic Subgradient Method for Nonsmooth, Nonconvex Problems , 2017, SIAM J. Optim..

[42]  Nicolas Gillis,et al.  Inertial Block Proximal Methods for Non-Convex Non-Smooth Optimization , 2019, ICML.

[43]  Philippe C. Besse,et al.  A L 1-norm PCA and a Heuristic Approach , 1996 .

[44]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[45]  James T. Kwok,et al.  Fast-and-Light Stochastic ADMM , 2016, IJCAI.

[46]  Chris H. Q. Ding,et al.  R1-PCA: rotational invariant L1-norm principal component analysis for robust subspace factorization , 2006, ICML.

[47]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[48]  Nicolas Gillis Nonnegative Matrix Factorization , 2020 .

[49]  Xuelong Li,et al.  L1-Norm-Based 2DPCA , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[50]  John Wright,et al.  RASL: Robust Alignment by Sparse and Low-Rank Decomposition for Linearly Correlated Images , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  Dmitriy Drusvyatskiy,et al.  Error Bounds, Quadratic Growth, and Linear Convergence of Proximal Methods , 2016, Math. Oper. Res..