Kernel Fisher Discriminants

In this thesis we consider statistical learning problems and machines. A statistical learning machine tries to infer rules from a given set of examples such that it is able to make correct predictions on unseen examples. These predictions can for example be a classification or a regression. We consider the class of kernel based learning techniques. The main contributions of this work can be summarized as follows. Building upon the theory of reproducing kernels we propose a number of new learning algorithms based on the maximization of a Rayleigh coefficient in a kernel feature space. We exemplify this for oriented (kernel) PCA, and especially for Fisher’s discriminant, yielding kernel Fisher discriminants (KFD). Furthermore, we show that KFD is intimately related to quadratic and linear optimization. Building upon this connection we propose several ways to deal with the optimization problems arising in kernel based methods and especially for KFD. This mathematical programming formulation is the starting point to derive several important and interesting variants of KFD, namely robust KFD, sparse KFD and linear KFD. Several algorithms to solve the resulting optimization problems are discussed. As a last consequence of the mathematical programming formulation we are able to relate KFD to other techniques like support vector machines, relevance vector machines and Arc-GV. Through a structural comparison of the underlying optimization problems we illustrate that many modern learning techniques, including KFD, are highly similar. In a separate chapter we present first results dealing with learning guarantees for eigenvalues and eigenvectors estimated from covariance matrices. We show that under some mild assumptions empirical eigenvalues are with high probability close to the expected eigenvalues when training on a specific, finite sample size. For eigenvectors we show that also with high probability an empirical eigenvector will be close to an eigenvector of the underlying distribution. In a large collection of experiments we demonstrate that KFD and its variants proposed here are capable of producing state of the art results. We compare KFD to techniques like AdaBoost and support vector machines, carefully discussing its advantages and also its difficulties.

[1]  C. Watkins Dynamic Alignment Kernels , 1999 .

[2]  Pal Rujan,et al.  Playing Billiards in Version Space , 1997, Neural Computation.

[3]  Klaus-Robert Müller,et al.  Classifying Single Trial EEG: Towards Brain Computer Interfacing , 2001, NIPS.

[4]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[5]  R. Freund,et al.  A new Krylov-subspace method for symmetric indefinite linear systems , 1994 .

[6]  J. Friedman Regularized Discriminant Analysis , 1989 .

[7]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[8]  Opper,et al.  Generalization performance of Bayes optimal classification algorithm for learning a perceptron. , 1991, Physical review letters.

[9]  K. Tsuda Support Vector Classi er with Asymmetric Kernel Functions , 1998 .

[10]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[11]  Gunnar Rätsch,et al.  An introduction to kernel-based learning algorithms , 2001, IEEE Trans. Neural Networks.

[12]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[13]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[14]  Gunnar Rätsch,et al.  A New Discriminative Kernel from Probabilistic Models , 2001, Neural Computation.

[15]  S. Keerthi,et al.  SMO Algorithm for Least-Squares SVM Formulations , 2003, Neural Computation.

[16]  J. H. Wilkinson The algebraic eigenvalue problem , 1966 .

[17]  Matthias W. Seeger,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[18]  Michael E. Tipping The Relevance Vector Machine , 1999, NIPS.

[19]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[20]  R. Beran,et al.  Bootstrap Tests and Confidence Regions for Functions of a Covariance Matrix , 1985 .

[21]  John Shawe-Taylor,et al.  A framework for structural risk minimisation , 1996, COLT '96.

[22]  Henk A. van der Vorst,et al.  Computing Probabilistic Bounds for Extreme Eigenvalues of Symmetric Matrices with the Lanczos Method , 2001, SIAM J. Matrix Anal. Appl..

[23]  David G. Luenberger,et al.  Linear and nonlinear programming , 1984 .

[24]  Stephen A. Billings,et al.  Nonlinear Fisher discriminant analysis using a minimum squared error cost function and the orthogonal least squares algorithm , 2002, Neural Networks.

[25]  Robert J. Vanderbei,et al.  Commentary - Interior-Point Methods: Algorithms and Formulations , 1994, INFORMS J. Comput..

[26]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[27]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[28]  Thore Graepel,et al.  Large Scale Bayes Point Machines , 2000, NIPS.

[29]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[30]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[31]  I. Johnstone On the distribution of the largest principal component , 2000 .

[32]  Gunnar Rätsch,et al.  Constructing Boosting Algorithms from SVMs: An Application to One-Class Classification , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[33]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[34]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[35]  R. Herbrich Bayesian Learning in Reproducing Kernel Hilbert Spaces , 1999 .

[36]  Volker Roth,et al.  Nonlinear Discriminant Analysis Using Kernel Functions , 1999, NIPS.

[38]  Gunnar Rätsch,et al.  Input space versus feature space in kernel-based methods , 1999, IEEE Trans. Neural Networks.

[39]  M. Talagrand A new look at independence , 1996 .

[40]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[41]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[42]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[43]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[44]  Marc Teboulle,et al.  An Interior Proximal Algorithm and the Exponential Multiplier Method for Semidefinite Programming , 1998, SIAM J. Optim..

[45]  David J. Hand,et al.  Kernel Discriminant Analysis , 1983 .

[46]  Narendra Ahuja,et al.  Face recognition using kernel eigenfaces , 2000, Proceedings 2000 International Conference on Image Processing (Cat. No.00CH37101).

[47]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[48]  John Shawe-Taylor,et al.  Generalisation Error Bounds for Sparse Linear Classifiers , 2000, COLT.

[49]  J. A. Anderson,et al.  7 Logistic discrimination , 1982, Classification, Pattern Recognition and Reduction of Dimensionality.

[50]  Xuegong Zhang,et al.  Kernel MSE algorithm: a unified framework for KFD, LS-SVM and KRR , 2001, IJCNN'01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222).

[51]  Robert J. Vanderbei,et al.  An Interior-Point Algorithm for Nonconvex Nonlinear Programming , 1999, Comput. Optim. Appl..

[52]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[53]  Michael Elad,et al.  Pattern Detection Using a Maximal Rejection Classifier , 2000, IWVF.

[54]  David R. Musicant,et al.  Lagrangian Support Vector Machines , 2001, J. Mach. Learn. Res..

[55]  Gunnar Rätsch,et al.  Robust Ensemble Learning for Data Mining , 2000, PAKDD.

[56]  Takio Kurita,et al.  A modification of kernel-based Fisher discriminant analysis for face detection , 2002, Proceedings of Fifth IEEE International Conference on Automatic Face Gesture Recognition.

[57]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[58]  Gunnar Rätsch,et al.  v-Arc: Ensemble Learning in the Presence of Outliers , 1999, NIPS.

[59]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[60]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[61]  Bernhard Schölkopf,et al.  Improving the accuracy and speed of support vector learning machines , 1997, NIPS 1997.

[62]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[63]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[64]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[65]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[66]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[67]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[68]  Gunnar Rätsch,et al.  Learning to Predict the Leave-One-Out Error of Kernel Based Classifiers , 2001, ICANN.

[69]  Sanjay Mehrotra,et al.  On the Implementation of a Primal-Dual Interior Point Method , 1992, SIAM J. Optim..

[70]  Alexander J. Smola,et al.  Sparse Greedy Gaussian Process Regression , 2000, NIPS.

[71]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[72]  Nesa L'abbe Wu,et al.  Linear programming and extensions , 1981 .

[73]  Gunnar Rätsch,et al.  Barrier Boosting , 2000, COLT.

[74]  R. Tibshirani,et al.  Penalized Discriminant Analysis , 1995 .

[75]  U. Garczarek Classification rules in standardized partition spaces , 2002 .

[76]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[77]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[78]  Federico Girosi,et al.  An improved training algorithm for support vector machines , 1997, Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE Signal Processing Society Workshop.

[79]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[80]  V. Koltchinskii,et al.  Rademacher Processes and Bounding the Risk of Function Learning , 2004, math/0405338.

[81]  Manfred K. Warmuth,et al.  Relating Data Compression and Learnability , 2003 .

[82]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[83]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[84]  John Shawe-Taylor,et al.  A PAC analysis of a Bayesian estimator , 1997, COLT '97.

[85]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[86]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[87]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[88]  Gunnar Rätsch,et al.  Robust Ensemble Learning , 2000 .

[89]  Yann LeCun,et al.  Transformation Invariance in Pattern Recognition-Tangent Distance and Tangent Propagation , 1996, Neural Networks: Tricks of the Trade.

[90]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[91]  Alexander J. Smola,et al.  Support Vector Machine Reference Manual , 1998 .

[92]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[93]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[94]  Gunnar Rätsch,et al.  Invariant Feature Extraction and Classification in Kernel Spaces , 1999, NIPS.

[95]  Gene H. Golub,et al.  Matrix computations , 1983 .

[96]  R. C. Williamson,et al.  Classification on proximity data with LP-machines , 1999 .

[97]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[98]  Gunnar Rätsch,et al.  Kernel PCA pattern reconstruction via approximate pre-images. , 1998 .

[99]  Bernhard Schölkopf,et al.  Regularized Principal Manifolds , 1999, J. Mach. Learn. Res..

[100]  M. Rosenblatt Remarks on Some Nonparametric Estimates of a Density Function , 1956 .

[101]  Gunnar Rätsch,et al.  A Mathematical Programming Approach to the Kernel Fisher Algorithm , 2000, NIPS.

[102]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[103]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[104]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[105]  Alexander Gammerman,et al.  Ridge Regression Learning Algorithm in Dual Variables , 1998, ICML.

[106]  David Haussler,et al.  A Discriminative Framework for Detecting Remote Protein Homologies , 2000, J. Comput. Biol..

[107]  Amnon Shashua,et al.  On the Relationship Between the Support Vector Machine for Classification and Sparsified Fisher's Linear Discriminant , 1999, Neural Processing Letters.

[108]  Federico Girosi,et al.  An Equivalence Between Sparse Approximation and Support Vector Machines , 1998, Neural Computation.

[109]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[110]  Nello Cristianini,et al.  On the Concentration of Spectral Properties , 2001, NIPS.

[111]  Bernhard Schölkopf,et al.  Kernel Methods for Extracting Local Image Semantics , 2001 .

[112]  Shun-ichi Amari,et al.  A Theory of Pattern Recognition , 1968 .

[113]  V. A. Morozov,et al.  Methods for Solving Incorrectly Posed Problems , 1984 .

[114]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[115]  Alkemade Pp,et al.  Playing Billiard in Version Space , 1997 .

[116]  Roland W. Freund,et al.  A QMR-based interior-point algorithm for solving linear programs , 1997, Math. Program..

[117]  G. Stewart Error and Perturbation Bounds for Subspaces Associated with Certain Eigenvalue Problems , 1973 .

[118]  A. Atkinson Subset Selection in Regression , 1992 .

[119]  Ayhan Demiriz,et al.  Linear Programming Boosting via Column Generation , 2002, Machine Learning.

[120]  Johan A. K. Suykens,et al.  Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis , 2002, Neural Computation.

[121]  W. Pitts,et al.  A Logical Calculus of the Ideas Immanent in Nervous Activity (1943) , 2021, Ideas That Created the Future.

[122]  Bernhard Schölkopf,et al.  An improved training algorithm for kernel Fisher discriminants , 2001, AISTATS.

[123]  Bernhard Schölkopf,et al.  The Kernel Trick for Distances , 2000, NIPS.

[124]  T. Watkin Optimal Learning with a Neural Network , 1993 .

[125]  Alexander J. Smola,et al.  Quantization Functionals and Regularized Principal Manifolds , 1998 .

[126]  Colin Campbell,et al.  Bayes Point Machines , 2001, J. Mach. Learn. Res..

[127]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[128]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[129]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[130]  Jürgen Schürmann,et al.  Pattern classification , 2008 .

[131]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[132]  Motoaki Kawanabe,et al.  A resampling approach to estimate the stability of one-dimensional or multidimensional independent components , 2002, IEEE Transactions on Biomedical Engineering.

[133]  Bernhard Schölkopf,et al.  Sampling Techniques for Kernel Methods , 2001, NIPS.

[134]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[135]  Peter L. Bartlett,et al.  Localized Rademacher Complexities , 2002, COLT.

[136]  Dale Schuurmans,et al.  General Convergence Results for Linear Discriminant Updates , 1997, COLT '97.

[137]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[138]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[139]  G. Rätsch Robust Boosting via Convex Optimization , 2001 .

[140]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[141]  R. Vanderbei LOQO user's manual — version 3.10 , 1999 .

[142]  Jos L. M. van Computing Probabilistic Bounds for Extreme Eigenvalues of Symmetric Matrices with the Lanczos Method , 2000 .

[143]  G. McLachlan Discriminant Analysis and Statistical Pattern Recognition , 1992 .

[144]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[145]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[146]  Ming-Hsuan Yang,et al.  Face Recognition Using Kernel Methods , 2001, NIPS.

[147]  Olvi L. Mangasarian,et al.  Mathematical Programming in Data Mining , 1997, Data Mining and Knowledge Discovery.

[148]  Leo Breiman,et al.  Bias, Variance , And Arcing Classifiers , 1996 .

[149]  R. Tibshirani,et al.  Discriminant Analysis by Gaussian Mixtures , 1996 .

[150]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[151]  Gunnar Rätsch,et al.  Sparse Regression Ensembles in Infinite and Finite Hypothesis Spaces , 2002, Machine Learning.

[152]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[153]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[154]  Ilse C. F. Ipsen,et al.  Three Absolute Perturbation Bounds for Matrix Eigenvalues Imply Relative Bounds , 1998, SIAM J. Matrix Anal. Appl..

[155]  Gunnar Rätsch,et al.  On the Convergence of Leveraging , 2001, NIPS.

[156]  Peter L. Bartlett,et al.  Improved Generalization Through Explicit Optimization of Margins , 2000, Machine Learning.

[157]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.