Sample-based online learning for bi-regular hinge loss

Support vector machine (SVM), a state-of-the-art classifier for supervised classification task, is famous for its strong generalization guarantees derived from the max-margin property. In this paper, we focus on the maximum margin classification problem cast by SVM and study the bi-regular hinge loss model, which not only performs feature selection but tends to select highly correlated features together. To solve this model, we propose an online learning algorithm that aims at solving a non-smooth minimization problem by alternating iterative mechanism. Basically, the proposed algorithm alternates between intrusion samples detection and iterative optimization, and at each iteration it obtains a closed-form solution to the model. In theory, we prove that the proposed algorithm achieves $$O(1/\sqrt{T})$$ convergence rate under some mild conditions, where T is the number of training samples received in online learning. Experimental results on synthetic data and benchmark datasets demonstrate the effectiveness and performance of our approach in comparison with several popular algorithms, such as LIBSVM, SGD, PEGASOS, SVRG, etc.

[1]  Manisha Singla,et al.  Robust statistics-based support vector machine and its variants: a survey , 2019, Neural Computing and Applications.

[2]  Li Wang,et al.  Hybrid huberized support vector machines for microarray classification and gene selection , 2008, Bioinform..

[3]  Syed Zubair,et al.  Design of Momentum Fractional Stochastic Gradient Descent for Recommender Systems , 2019, IEEE Access.

[4]  Shiqian Ma,et al.  Barzilai-Borwein Step Size for Stochastic Gradient Descent , 2016, NIPS.

[5]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[6]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[7]  Lorenzo Rosasco,et al.  Elastic-net regularization in learning theory , 2008, J. Complex..

[8]  Ambuj Tewari,et al.  Composite objective mirror descent , 2010, COLT 2010.

[9]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[10]  Hao Helen Zhang,et al.  ON THE ADAPTIVE ELASTIC-NET WITH A DIVERGING NUMBER OF PARAMETERS. , 2009, Annals of statistics.

[11]  N. Littlestone Learning Quickly When Irrelevant Attributes Abound: A New Linear-Threshold Algorithm , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Bingsheng He,et al.  On the O(1/n) Convergence Rate of the Douglas-Rachford Alternating Direction Method , 2012, SIAM J. Numer. Anal..

[14]  H. Zou,et al.  The doubly regularized support vector machine , 2006 .

[15]  Yangyang Xu,et al.  Proximal gradient method for huberized support vector machine , 2015, Pattern Analysis and Applications.

[16]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[17]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[18]  Taiji Suzuki,et al.  Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method , 2013, ICML.

[19]  Chih-Jen Lin,et al.  A dual coordinate descent method for large-scale linear SVM , 2008, ICML '08.

[20]  Anuj Sharma,et al.  Problem formulations and solvers in linear SVM: a review , 2018, Artificial Intelligence Review.

[21]  Wei Xu,et al.  Machine Learning for Multimedia Content Analysis , 2007 .

[22]  D. Angluin Queries and Concept Learning , 1988 .

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Ruimin Hu,et al.  Face Hallucination Via Weighted Adaptive Sparse Regularization , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[25]  R. Srikant,et al.  On projected stochastic gradient descent algorithm with weighted averaging for least squares regression , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Juan Humberto Sossa Azuela,et al.  Dendrite morphological neurons trained by stochastic gradient descent , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[27]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[28]  Stephen P. Boyd,et al.  Enhancing Sparsity by Reweighted ℓ1 Minimization , 2007, 0711.1612.

[29]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[30]  Bahman Gharesifard,et al.  Individual Regret Bounds for the Distributed Online Alternating Direction Method of Multipliers , 2019, IEEE Transactions on Automatic Control.

[31]  Lin Xiao,et al.  Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization , 2009, J. Mach. Learn. Res..

[32]  Dazi Li,et al.  Online ADMM-Based Extreme Learning Machine for Sparse Supervised Learning , 2019, IEEE Access.

[33]  Sara van de Geer,et al.  Statistics for High-Dimensional Data: Methods, Theory and Applications , 2011 .

[34]  Vladimir Naumovich Vapni The Nature of Statistical Learning Theory , 1995 .

[35]  Chih-Jen Lin,et al.  Trust Region Newton Method for Logistic Regression , 2008, J. Mach. Learn. Res..

[36]  Jakub Nalepa,et al.  Selecting training sets for support vector machines: a review , 2018, Artificial Intelligence Review.

[37]  Sara van de Geer,et al.  Statistics for High-Dimensional Data , 2011 .

[38]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[39]  Martin Zinkevich,et al.  Online Convex Programming and Generalized Infinitesimal Gradient Ascent , 2003, ICML.

[40]  J. Borwein,et al.  Two-Point Step Size Gradient Methods , 1988 .

[41]  Chih-Jen Lin,et al.  Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines , 2008, J. Mach. Learn. Res..

[42]  Geoffrey I. Webb,et al.  Encyclopedia of Machine Learning , 2011, Encyclopedia of Machine Learning.

[43]  Chih-Jen Lin,et al.  Trust region Newton methods for large-scale logistic regression , 2007, ICML '07.

[44]  Yuanyuan Liu,et al.  Accelerated Variance Reduced Stochastic ADMM , 2017, AAAI.

[45]  B. Mercier,et al.  A dual algorithm for the solution of nonlinear variational problems via finite element approximation , 1976 .

[46]  Anuj Sharma,et al.  Stochastic trust region inexact Newton method for large-scale machine learning , 2018, Int. J. Mach. Learn. Cybern..

[47]  Heng Huang,et al.  Faster Stochastic Alternating Direction Method of Multipliers for Nonconvex Optimization , 2020, ICML.

[48]  Yoram Singer,et al.  Pegasos: primal estimated sub-gradient solver for SVM , 2011, Math. Program..

[49]  M. Garey Johnson: computers and intractability: a guide to the theory of np- completeness (freeman , 1979 .

[50]  Zongxia Xie,et al.  Large-scale support vector regression with budgeted stochastic gradient descent , 2018, Int. J. Mach. Learn. Cybern..

[51]  Suely Oliveira,et al.  Smoothed Hinge Loss and ℓ1 Support Vector Machines , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[52]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[53]  Wensheng Zhang,et al.  Learning a Coupled Linearized Method in Online Setting , 2017, IEEE Transactions on Neural Networks and Learning Systems.