Superfast-Trainable Multi-Class Probabilistic Classifier by Least-Squares Posterior Fitting

Kernel logistic regression (KLR) is a powerful and flexible classification algorithm, which possesses an ability to provide the confidence of class prediction. However, its training—typically carried out by (quasi-)Newton methods—is rather timeconsuming. In this paper, we propose an alternative probabilistic classification algorithm called Least-Squares Probabilistic Classifier (LSPC). KLR models the class-posterior probability by the log-linear combination of kernel functions and its parameters are learned by (regularized) maximum likelihood. In contrast, LSPC employs the linear combination of kernel functions and its parameters are learned by regularized least-squares fitting of the true class-posterior probability. Thanks to this linear regularized least-squares formulation, the solution of LSPC can be computed analytically just by solving a regularized system of linear equations in a class-wise manner. Thus LSPC is computationally very efficient and numerically stable. Through experiments, we show that the computation time of LSPC is faster than that of KLR by orders of magnitude, with comparable classification accuracy.

[1]  Chih-Jen Lin,et al.  Probability Estimates for Multi-class Classification by Pairwise Coupling , 2003, J. Mach. Learn. Res..

[2]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[3]  Samy Bengio,et al.  SVMTorch: Support Vector Machines for Large-Scale Regression Problems , 2001, J. Mach. Learn. Res..

[4]  E. Newport,et al.  Science Current Directions in Psychological Statistical Learning : from Acquiring Specific Items to Forming General Rules on Behalf Of: Association for Psychological Science , 2022 .

[5]  Sören Sonnenburg,et al.  Optimized Cutting Plane Algorithm for Large-Scale Risk Minimization , 2009, J. Mach. Learn. Res..

[6]  Masashi Sugiyama,et al.  Conic Programming for Multitask Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[7]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[8]  Jing Peng,et al.  SVM vs regularized least squares classification , 2004, ICPR 2004.

[9]  Takafumi Kanamori,et al.  Theoretical Analysis of Density Ratio Estimation , 2010, IEICE Trans. Fundam. Electron. Commun. Comput. Sci..

[10]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[11]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[12]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[13]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[14]  Tom Heskes,et al.  Task Clustering and Gating for Bayesian Multitask Learning , 2003, J. Mach. Learn. Res..

[15]  Ivor W. Tsang,et al.  Core Vector Machines: Fast SVM Training on Very Large Data Sets , 2005, J. Mach. Learn. Res..

[16]  Hao Helen Zhang,et al.  Multiclass Proximal Support Vector Machines , 2006 .

[17]  Shun-ichi Amari,et al.  A Theory of Adaptive Pattern Classifiers , 1967, IEEE Trans. Electron. Comput..

[18]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[19]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[20]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[21]  Robert Tibshirani,et al.  The Entire Regularization Path for the Support Vector Machine , 2004, J. Mach. Learn. Res..

[22]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machines , 2002 .

[23]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[24]  T. Minka A comparison of numerical optimizers for logistic regression , 2004 .

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  Masashi Sugiyama,et al.  Condition Number Analysis of Kernel-based Density Ratio Estimation , 2009, 0912.2800.

[27]  Alexander J. Smola,et al.  A scalable modular convex solver for regularized risk minimization , 2007, KDD '07.

[28]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[30]  Glenn Fung,et al.  Multicategory Proximal Support Vector Machine Classifiers , 2005, Machine Learning.

[31]  Rich Caruana,et al.  Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[32]  T. Poggio,et al.  Regularized Least-Squares Classification 133 In practice , although , 2007 .

[33]  Stephen P. Boyd,et al.  An Interior-Point Method for Large-Scale l1-Regularized Logistic Regression , 2007, J. Mach. Learn. Res..

[34]  Masashi Sugiyama,et al.  Improving the Accuracy of Least-Squares Probabilistic Classifiers , 2011, IEICE Trans. Inf. Syst..

[35]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.

[36]  Takafumi Kanamori,et al.  A Least-squares Approach to Direct Importance Estimation , 2009, J. Mach. Learn. Res..