Spectral properties of the kernel matrix and their relation to kernel methods in machine learning

This chapter serves as a brief introduction to the supervised learning setting and kernel methods. Moreover, several results from linear algebra, probability theory, and functional analysis are reviewed which will be used throughout the thesis. 2.1 Some notational conventions We begin by introducing some basic notational conventions. The sets N, Z, R, C denote the natural, integer, real, and complex numbers. Vectors will be denoted by lowercase letters, whereas matrices will be denoted by bold uppercase letters. Random variables will be denoted by uppercase letters. The individual entries of vectors and matrices are denoted by square brackets. For example, x ∈ R is a vector with coefficients [x]i. The matrix A has entries [A]ij . Vector and matrix transpose is denoted by x>. Sometimes, the set of square n × n matrices are denoted by Mn, and the set of general n × m matrices by Mn,m. The set of eigenvalues of a square matrix A is denoted by λ(A). For a symmetric n × n matrix A, we will always assume that the eigenvalues and eigenvectors are sorted in non-increasing order with eigenvalues repeated according to their multiplicity. The eigenvalues of A are thus λ1(A) ≥ . . . ≥ λn(A). We use the following standard norms on finite-dimensional vector spaces. Let x ∈ R and A ∈Mn. Then, ‖x‖ = √√√√ n ∑ i=1 [x]i , ‖A‖ = max x : ‖x‖6=0 ‖Ax‖ ‖x‖ . (2.1) A useful upper bound on ‖A‖ is given by ‖A‖ ≤ n max 1≤i,j≤n |[A]ij |. (2.2) Another matrix norm we will encounter is the Frobenius norm ‖A‖F = √√√√ n ∑

[1]  C.E. Shannon,et al.  Communication in the Presence of Noise , 1949, Proceedings of the IRE.

[2]  Anja Vogler,et al.  An Introduction to Multivariate Statistical Analysis , 2004 .

[3]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[4]  I. Johnstone,et al.  Adapting to Unknown Smoothness via Wavelet Shrinkage , 1995 .

[5]  S. Keleş,et al.  Statistical Applications in Genetics and Molecular Biology Asymptotic Optimality of Likelihood-Based Cross-Validation , 2011 .

[6]  Maria Petrou,et al.  Learning in Pattern Recognition , 1999, MLDM.

[7]  Gilles Blanchard,et al.  Statistical properties of Kernel Prinicipal Component Analysis , 2019 .

[8]  Carl E. Rasmussen,et al.  In Advances in Neural Information Processing Systems , 2011 .

[9]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[10]  Gesammelte Abhandlungen , 1906, Nature.

[11]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[12]  K. Atkinson The Numerical Solution of Integral Equations of the Second Kind , 1997 .

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Christopher K. I. Williams Learning Kernel Classifiers , 2003 .

[15]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[16]  V. Koltchinskii Asymptotics of Spectral Projections of Some Random Matrices Approximating Integral Operators , 1998 .

[17]  Nicolas Le Roux,et al.  Learning Eigenfunctions Links Spectral Embedding and Kernel PCA , 2004, Neural Computation.

[18]  T. W. Anderson ASYMPTOTIC THEORY FOR PRINCIPAL COMPONENT ANALYSIS , 1963 .

[19]  J. Steele Probability theory and combinatorial optimization , 1987 .

[20]  R. F.,et al.  Mathematical Statistics , 1944, Nature.

[21]  John Shawe-Taylor,et al.  The Stability of Kernel Principal Components Analysis and its Relation to the Process Eigenspectrum , 2002, NIPS.

[22]  Guido D. Salvucci,et al.  Ieee standard for binary floating-point arithmetic , 1985 .

[23]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[24]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[25]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[26]  Edwin K. P. Chong,et al.  On relative convergence properties of principal component analysis algorithms , 1998, IEEE Trans. Neural Networks.

[27]  V. Koltchinskii,et al.  Random matrix approximation of spectra of integral operators , 2000 .

[28]  Tosio Kato Perturbation theory for linear operators , 1966 .

[29]  R. O. Schmidt,et al.  Multiple emitter location and signal Parameter estimation , 1986 .

[30]  I. Johnstone,et al.  Wavelet Shrinkage: Asymptopia? , 1995 .

[31]  Tomaso Poggio,et al.  Everything old is new again: a fresh look at historical approaches in machine learning , 2002 .

[32]  Paul W. Goldberg,et al.  Regression with Input-dependent Noise: A Gaussian Process Treatment , 1997, NIPS.

[33]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[34]  Nello Cristianini,et al.  On the Eigenspectrum of the Gram Matrix and Its Relationship to the Operator Eigenspectrum , 2002, ALT.

[35]  I. Johnstone,et al.  Minimax estimation via wavelet shrinkage , 1998 .

[36]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[37]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[38]  Ulrike von Luxburg,et al.  Statistical learning with similarity and dissimilarity functions , 2004 .

[39]  Gilles Blanchard,et al.  Kernel Projection Machine: a New Tool for Pattern Recognition , 2004, NIPS.

[40]  J. Dauxois,et al.  Asymptotic theory for the principal component analysis of a vector random function: Some applications to statistical inference , 1982 .

[41]  A. Hoffman,et al.  The variation of the spectrum of a normal matrix , 1953 .

[42]  V. N. Bogaevski,et al.  Matrix Perturbation Theory , 1991 .

[43]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[44]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[45]  W. Kahan,et al.  The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[46]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[47]  G. Micula,et al.  Numerical Treatment of the Integral Equations , 1999 .

[48]  Nello Cristianini,et al.  On the Concentration of Spectral Properties , 2001, NIPS.

[49]  P. Anselone,et al.  Collectively Compact Operator Approximation Theory and Applications to Integral Equations , 1971 .

[50]  Nello Cristianini,et al.  On the eigenspectrum of the gram matrix and the generalization error of kernel-PCA , 2005, IEEE Transactions on Information Theory.

[51]  M. Abramowitz,et al.  Handbook of Mathematical Functions With Formulas, Graphs and Mathematical Tables (National Bureau of Standards Applied Mathematics Series No. 55) , 1965 .

[52]  V. Hutson Integral Equations , 1967, Nature.

[53]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[54]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[55]  G. Wahba Spline models for observational data , 1990 .

[56]  E. Nyström Über Die Praktische Auflösung von Integralgleichungen mit Anwendungen auf Randwertaufgaben , 1930 .

[57]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .