Multiple Kernel Learning from $U$-Statistics of Empirical Measures in the Feature Space

We propose a novel data-driven method to learn multiple kernels in kernel methods of statistical machine learning from training samples. The proposed kernel learning algorithm is based on a $U$-statistics of the empirical marginal distributions of features in the feature space given their class labels. We prove the consistency of the $U$-statistic estimate using the empirical distributions for kernel learning. In particular, we show that the empirical estimate of $U$-statistic converges to its population value with respect to all admissible distributions as the number of the training samples increase. We also prove the sample optimality of the estimate by establishing a minimax lower bound via Fano's method. In addition, we establish the generalization bounds of the proposed kernel learning approach by computing novel upper bounds on the Rademacher and Gaussian complexities using the concentration of measures for the quadratic matrix forms.We apply the proposed kernel learning approach to classification of the real-world data-sets using the kernel SVM and compare the results with $5$-fold cross-validation for the kernel model selection problem. We also apply the proposed kernel learning approach to devise novel architectures for the semantic segmentation of biomedical images. The proposed segmentation networks are suited for training on small data-sets and employ new mechanisms to generate representations from input images.

[1]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[2]  Danilo Comminiello,et al.  Group sparse regularization for deep neural networks , 2016, Neurocomputing.

[3]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[4]  Thomas Wiatowski,et al.  A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction , 2015, IEEE Transactions on Information Theory.

[5]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[6]  Vahid Tarokh,et al.  On Data-Dependent Random Features for Improved Generalization in Supervised Learning , 2017, AAAI.

[7]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[8]  J. Cima,et al.  On weak* convergence in ¹ , 1996 .

[9]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[10]  Alex Smola,et al.  Kernel methods in machine learning , 2007, math/0701907.

[11]  Michael I. Jordan,et al.  A Swiss Army Infinitesimal Jackknife , 2018, AISTATS.

[12]  A. Atiya,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2005, IEEE Transactions on Neural Networks.

[13]  Benjamin Recht,et al.  Random Features for Large-Scale Kernel Machines , 2007, NIPS.

[14]  Petros Drineas,et al.  On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning , 2005, J. Mach. Learn. Res..

[15]  William T. Stephenson,et al.  Return of the Infinitesimal Jackknife , 2018 .

[16]  V. Marčenko,et al.  DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES , 1967 .

[17]  A. Montanari,et al.  The spectral norm of random inner-product kernel matrices , 2015, 1507.05343.

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  M. Rudelson,et al.  Hanson-Wright inequality and sub-gaussian concentration , 2013 .

[20]  Norbert Sauer,et al.  On the Density of Families of Sets , 1972, J. Comb. Theory, Ser. A.

[21]  Shiva Prasad Kasiviswanathan,et al.  Spectral Norm of Random Kernel Matrices with Applications to Privacy , 2015, APPROX-RANDOM.

[22]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Michael I. Jordan,et al.  Kernel dimension reduction in regression , 2009, 0908.1854.

[24]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[25]  Yoram Baram,et al.  Learning by Kernel Polarization , 2005, Neural Computation.

[26]  Xiao Zhang,et al.  Online Kernel Selection via Incremental Sketched Kernel Alignment , 2018, IJCAI.

[27]  Vahab S. Mirrokni,et al.  Approximate Leave-One-Out for Fast Parameter Tuning in High Dimensions , 2018, ICML.

[28]  Mehryar Mohri,et al.  Two-Stage Learning Kernel Algorithms , 2010, ICML.

[29]  D. Pollard Empirical Processes: Theory and Applications , 1990 .

[30]  Gitta Kutyniok,et al.  Introduction to Shearlets , 2012 .

[31]  Trevor Darrell,et al.  Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Martin Greiner,et al.  Wavelets , 2018, Complex..

[33]  Vincent Lepetit,et al.  DAISY: An Efficient Dense Descriptor Applied to Wide-Baseline Stereo , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[35]  Xiuyuan Cheng,et al.  THE SPECTRUM OF RANDOM INNER-PRODUCT KERNEL MATRICES , 2012, 1202.3155.

[36]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[38]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[39]  S. Mallat,et al.  Invariant Scattering Convolution Networks , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Nello Cristianini,et al.  Learning the Kernel Matrix with Semidefinite Programming , 2002, J. Mach. Learn. Res..

[41]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[42]  David Eichelberger,et al.  Harmonic Analysis And The Theory Of Probability , 2016 .

[43]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[44]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[45]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[46]  Martin J. Wainwright,et al.  Kernel Feature Selection via Conditional Covariance Minimization , 2017, NIPS.

[47]  T. Tao Topics in Random Matrix Theory , 2012 .

[48]  Pierre C Bellec,et al.  Concentration of quadratic forms under a Bernstein moment assumption , 2019, 1901.08736.

[49]  Ethem Alpaydin,et al.  Multiple Kernel Learning Algorithms , 2011, J. Mach. Learn. Res..

[50]  D. Pollard Convergence of stochastic processes , 1984 .

[51]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[52]  Suet-Peng Yong,et al.  A comparison of deep learning and hand crafted features in medical image modality classification , 2016, 2016 3rd International Conference on Computer and Information Sciences (ICCOINS).

[53]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[54]  Benjamin Recht,et al.  Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning , 2008, NIPS.

[55]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[56]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[57]  Bernhard Schölkopf,et al.  Kernel Mean Embedding of Distributions: A Review and Beyonds , 2016, Found. Trends Mach. Learn..

[58]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[59]  Mehryar Mohri,et al.  Algorithms for Learning Kernels Based on Centered Alignment , 2012, J. Mach. Learn. Res..

[60]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[61]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[62]  Noureddine El Karoui,et al.  The spectrum of kernel random matrices , 2010, 1001.0492.

[63]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[64]  Amiel Feinstein,et al.  Information and information stability of random variables and processes , 1964 .

[65]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[66]  Vahid Tarokh,et al.  Learning Bounds for Greedy Approximation with Explicit Feature Maps from Multiple Kernels , 2018, NeurIPS.

[67]  Vahid Tarokh,et al.  On Optimal Generalizability in Parametric Learning , 2017, NIPS.

[68]  D. Voiculescu Addition of certain non-commuting random variables , 1986 .

[69]  Bernhard Schölkopf,et al.  Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels , 2016, NIPS.

[70]  Matthias Holschneider,et al.  Wavelets - an analysis tool , 1995, Oxford mathematical monographs.

[71]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[72]  Stefano Ermon,et al.  Transfer Learning from Deep Features for Remote Sensing and Poverty Mapping , 2015, AAAI.

[73]  Mehryar Mohri,et al.  New Generalization Bounds for Learning Kernels , 2009, ArXiv.

[74]  John C. Duchi,et al.  Learning Kernels with Random Features , 2016, NIPS.

[75]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.