A kernel‐expanded stochastic neural network

The deep neural network suffers from many fundamental issues in machine learning. For example, it often gets trapped into a local minimum in training, and its prediction uncertainty is hard to be assessed. To address these issues, we propose the so-called kernel-expanded stochastic neural network (K-StoNet) model, which incorporates support vector regression (SVR) as the first hidden layer and reformulates the neural network as a latent variable model. The former maps the input vector into an infinite dimensional feature space via a radial basis function (RBF) kernel, ensuring absence of local minima on its training loss surface. The latter breaks the high-dimensional nonconvex neural network training problem into a series of low-dimensional convex optimization problems, and enables its prediction uncertainty easily assessed. The K-StoNet can be easily trained using the imputation-regularized optimization (IRO) algorithm. Compared to traditional deep neural networks, K-StoNet possesses a theoretical guarantee to asymptotically converge to the global optimum and enables the prediction uncertainty easily assessed. The performances of the new model in training, prediction and uncertainty quantification are illustrated by simulated and real data examples.

[1]  N. Oza,et al.  Large scale support vector regression for aviation safety , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[2]  Faming Liang,et al.  Consistent Sparse Deep Learning: Theory and Computation , 2021, Journal of the American Statistical Association.

[3]  Geoffrey E. Hinton,et al.  Deep Boltzmann Machines , 2009, AISTATS.

[4]  Dymitr Ruta,et al.  Greedy Incremental Support Vector Regression , 2019, 2019 Federated Conference on Computer Science and Information Systems (FedCSIS).

[5]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[6]  Ivor W. Tsang,et al.  Two-Layer Multiple Kernel Learning , 2011, AISTATS.

[7]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[8]  Barbara Hammer,et al.  A Note on the Universal Approximation Capability of Support Vector Machines , 2003, Neural Processing Letters.

[9]  Veronika Rocková,et al.  Uncertainty Quantification for Sparse Deep Learning , 2020, AISTATS.

[10]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[11]  Po-Ling Loh,et al.  Support recovery without incoherence: A case for nonconvex regularization , 2014, ArXiv.

[12]  Ping Wang,et al.  Adversarial Noise Layer: Regularize Neural Network by Adding Noise , 2018, 2019 IEEE International Conference on Image Processing (ICIP).

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Walid Mahdi,et al.  Deep multilayer multiple kernel learning , 2016, Neural Computing and Applications.

[15]  Quanquan Gu,et al.  An Improved Analysis of Training Over-parameterized Deep Neural Networks , 2019, NeurIPS.

[16]  Jooyoung Park,et al.  Universal Approximation Using Radial-Basis-Function Networks , 1991, Neural Computation.

[17]  Bingsheng He,et al.  ThunderSVM: A Fast SVM Library on GPUs and CPUs , 2018, J. Mach. Learn. Res..

[18]  Klaus-Robert Müller,et al.  Incremental Support Vector Learning: Analysis, Implementation and Applications , 2006, J. Mach. Learn. Res..

[19]  Cordelia Schmid,et al.  Convolutional Kernel Networks , 2014, NIPS.

[20]  Po-Ling Loh,et al.  Statistical consistency and asymptotic normality for high-dimensional robust M-estimators , 2015, ArXiv.

[21]  Andrew Gordon Wilson,et al.  Deep Kernel Learning , 2015, AISTATS.

[22]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[23]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[24]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[25]  Misha Denil,et al.  Noisy Activation Functions , 2016, ICML.

[26]  Cun-Hui Zhang Nearly unbiased variable selection under minimax concave penalty , 2010, 1002.4734.

[27]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[28]  Kaiyi Wang,et al.  A New SVR Incremental Algorithm Based on Boundary Vector , 2010, 2010 International Conference on Computational Intelligence and Software Engineering.

[29]  S. Balasundaram,et al.  Lagrangian support vector regression via unconstrained convex minimization , 2014, Neural Networks.

[30]  S. Portnoy Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity , 1988 .

[31]  Vladimir Cherkassky,et al.  The Nature Of Statistical Learning Theory , 1997, IEEE Trans. Neural Networks.

[32]  Bernhard Schölkopf,et al.  A tutorial on support vector regression , 2004, Stat. Comput..

[33]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[34]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[35]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[36]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[37]  Alberto Tesi,et al.  On the Problem of Local Minima in Backpropagation , 1992, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  V. Vapnik,et al.  A note one class of perceptrons , 1964 .

[39]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[40]  James Theiler,et al.  Accurate On-line Support Vector Regression , 2003, Neural Computation.

[41]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[42]  Michael Griebel,et al.  A representer theorem for deep kernel learning , 2017, J. Mach. Learn. Res..

[43]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[44]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[45]  Jianqing Fan,et al.  Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties , 2001 .

[46]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[47]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[48]  Geoffrey E. Hinton Learning multiple layers of representation , 2007, Trends in Cognitive Sciences.

[49]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[50]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[51]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[53]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[54]  Ingo Steinwart,et al.  Consistency and robustness of kernel-based regression in convex risk minimization , 2007, 0709.0626.

[55]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[56]  S. Nielsen The stochastic EM algorithm: estimation and asymptotic results , 2000 .

[57]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[58]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[59]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[60]  Shyam Visweswaran,et al.  Deep Multiple Kernel Learning , 2013, 2013 12th International Conference on Machine Learning and Applications.

[61]  Faming Liang,et al.  An imputation–regularized optimization algorithm for high dimensional missing data problems and beyond , 2018, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[62]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[63]  Junbin Gao,et al.  A Probabilistic Framework for SVM Regression and Error Bar Estimation , 2002, Machine Learning.

[64]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[65]  J. D. Doll,et al.  Brownian dynamics as smart Monte Carlo simulation , 1978 .

[66]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .