A Random-Feature Based Newton Method for Empirical Risk Minimization in Reproducing Kernel Hilbert Space

In supervised learning using kernel methods, we encounter a large-scale finite-sum minimization over a reproducing kernel Hilbert space (RKHS). Often times large-scale finite-sum problems can be solved using efficient variants of Newton's method where the Hessian is approximated via sub-samples. In RKHS, however, the dependence of the penalty function to kernel makes standard sub-sampling approaches inapplicable, since the gram matrix is not readily available in a low-rank form. In this paper, we observe that for this class of problems, one can naturally use kernel approximation to speed up the Newton's method. Focusing on randomized features for kernel approximation, we provide a novel second-order algorithm that enjoys local superlinear convergence and global convergence in the high probability sense. The key to our analysis is showing that the approximated Hessian via random features preserves the spectrum of the original Hessian. We provide numerical experiments verifying the efficiency of our approach, compared to variants of sub-sampling methods.

[1]  Jorge Nocedal,et al.  On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning , 2011, SIAM J. Optim..

[2]  Jorge Nocedal,et al.  Sample size selection in optimization methods for machine learning , 2012, Math. Program..

[3]  Naman Agarwal,et al.  Second-Order Stochastic Optimization for Machine Learning in Linear Time , 2016, J. Mach. Learn. Res..

[4]  Joel A. Tropp,et al.  An Introduction to Matrix Concentration Inequalities , 2015, Found. Trends Mach. Learn..

[5]  D. Goldfarb A family of variable-metric methods derived by variational means , 1970 .

[6]  Denis J. Dean,et al.  Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables , 1999 .

[7]  David H. Mathews,et al.  Detection of non-coding RNAs on the basis of predicted secondary structure formation free energy change , 2006, BMC Bioinformatics.

[8]  Quanfu Fan,et al.  Random Laplace Feature Maps for Semigroup Kernels on Histograms , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Peng Xu,et al.  Sub-sampled Newton Methods with Non-uniform Sampling , 2016, NIPS.

[10]  Jorge Nocedal,et al.  A Stochastic Quasi-Newton Method for Large-Scale Optimization , 2014, SIAM J. Optim..

[11]  Michael W. Mahoney,et al.  Revisiting the Nystrom Method for Improved Large-scale Machine Learning , 2013, J. Mach. Learn. Res..

[12]  Vahid Tarokh,et al.  On Data-Dependent Random Features for Improved Generalization in Supervised Learning , 2017, AAAI.

[13]  Robert M. Gower,et al.  Stochastic Block BFGS: Squeezing More Curvature out of Data , 2016, ICML.

[14]  Andrea Montanari,et al.  Convergence rates of sub-sampled Newton methods , 2015, NIPS.

[15]  Surya Ganguli,et al.  Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods , 2013, ICML.

[16]  Mojmir Mutny,et al.  Stochastic Second-Order Optimization via von Neumann Series , 2016, ArXiv.

[17]  James Martens,et al.  Deep learning via Hessian-free optimization , 2010, ICML.

[18]  R. Fletcher,et al.  A New Approach to Variable Metric Algorithms , 1970, Comput. J..

[19]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[20]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[21]  Michael W. Mahoney,et al.  Sub-sampled Newton methods , 2018, Math. Program..

[22]  Martin J. Wainwright,et al.  Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence , 2015, SIAM J. Optim..

[23]  Simon Günter,et al.  A Stochastic Quasi-Newton Method for Online Convex Optimization , 2007, AISTATS.

[24]  D. Shanno Conditioning of Quasi-Newton Methods for Function Minimization , 1970 .

[25]  Ameya Velingker,et al.  Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees , 2018, ICML.

[26]  Patrick Gallinari,et al.  SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent , 2009, J. Mach. Learn. Res..

[27]  C. G. Broyden The Convergence of a Class of Double-rank Minimization Algorithms 2. The New Algorithm , 1970 .

[28]  Stephen J. Wright,et al.  Computational Methods for Sparse Solution of Linear Inverse Problems , 2010, Proceedings of the IEEE.

[29]  Aryan Mokhtari,et al.  Global convergence of online limited memory BFGS , 2014, J. Mach. Learn. Res..

[30]  Chih-Jen Lin,et al.  Subsampled Hessian Newton Methods for Supervised Learning , 2015, Neural Computation.

[31]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[32]  Vincent Y. F. Tan,et al.  Stochastic L-BFGS: Improved Convergence Rates and Practical Acceleration Strategies , 2017, IEEE Transactions on Signal Processing.

[33]  J. Nocedal,et al.  Exact and Inexact Subsampled Newton Methods for Optimization , 2016, 1609.08502.

[34]  Harish Karnick,et al.  Random Feature Maps for Dot Product Kernels , 2012, AISTATS.

[35]  Aryan Mokhtari,et al.  IQN: An Incremental Quasi-Newton Method with Local Superlinear Convergence Rate , 2017, SIAM J. Optim..

[36]  Michael I. Jordan,et al.  A Linearly-Convergent Stochastic L-BFGS Algorithm , 2015, AISTATS.

[37]  Aryan Mokhtari,et al.  RES: Regularized Stochastic BFGS Algorithm , 2014, IEEE Transactions on Signal Processing.

[38]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.