Kernel Methods in Machine Learning 1

We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data. 1. Introduction. Over the last ten years estimation and learning methods utilizing positive definite kernels have become rather popular, particularly in machine learning. Since these methods have a stronger mathematical slant than earlier machine learning methods (e.g., neural networks), there is also significant interest in the statistics and mathematics community for these methods. The present review aims to summarize the state of the art on a conceptual level. In doing so, we build on various sources, including Burges but we also add a fair amount of more recent material which helps unifying the exposition. We have not had space to include proofs; they can be found either in the long version of the present paper (see Hofmann et al. [69]), in the references given or in the above books. The main idea of all the described methods can be summarized in one paragraph. Traditionally, theory and algorithms of machine learning and

[1]  J. Mercer Functions of positive and negative type, and their connection with the theory of integral equations , 1909 .

[2]  S. Bochner Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse , 1933 .

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[5]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[6]  R. Fortet,et al.  Convergence de la répartition empirique vers la répartition théorique , 1953 .

[7]  A. Rényi On measures of dependence , 1959 .

[8]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[9]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[10]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[11]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[12]  J. Williamson Harmonic Analysis on Semigroups , 1967 .

[13]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[14]  E. Parzen STATISTICAL INFERENCE ON TIME SERIES BY RKHS METHODS. , 1970 .

[15]  R. Tyrrell Rockafellar,et al.  Convex Analysis , 1970, Princeton Landmarks in Mathematics and Physics.

[16]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[19]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[20]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[21]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[22]  E. Polak Introduction to linear and nonlinear programming , 1973 .

[23]  Chong-sun Kim Canonical Analysis of Several Sets of Variables , 1973 .

[24]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[25]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[26]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[27]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[28]  W. Steiger,et al.  Least Absolute Deviations: Theory, Applications and Algorithms , 1984 .

[29]  B. Yandell,et al.  Semi-Parametric Generalized Linear Models. , 1985 .

[30]  C. Atkinson METHODS FOR SOLVING INCORRECTLY POSED PROBLEMS , 1985 .

[31]  B. Yandell,et al.  Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[32]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[33]  J. Friedman Exploratory Projection Pursuit , 1987 .

[34]  R. Fletcher Practical Methods of Optimization , 1988 .

[35]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[36]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[37]  Steffen L. Lauritzen,et al.  Bayesian updating in causal probabilistic networks by local computations , 1990 .

[38]  D. Mason,et al.  Generalized quantile processes , 1992 .

[39]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[40]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[41]  A. P. Dawid,et al.  Applications of a general propagation algorithm for probabilistic expert systems , 1992 .

[42]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[43]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1993, Proceedings of 1993 IEEE 34th Annual Foundations of Computer Science.

[44]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[45]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[46]  P. Sen,et al.  Restricted canonical correlations , 1994 .

[47]  W. Press,et al.  Numerical Recipes in Fortran: The Art of Scientific Computing.@@@Numerical Recipes in C: The Art of Scientific Computing. , 1994 .

[48]  C. Micchelli,et al.  Functions that preserve families of positive semidefinite matrices , 1995 .

[49]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[50]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[51]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[52]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[53]  David M. Magerman,et al.  Learning grammatical stucture using statistical decision-trees , 1996, ICGI.

[54]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[55]  Shun-ichi Amari,et al.  Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[56]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[58]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[59]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[60]  Jean-Francois Cardoso,et al.  Blind signal separation: statistical principles , 1998, Proc. IEEE.

[61]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[62]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[63]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[64]  J. Dauxois,et al.  Nonlinear canonical analysis and independence tests , 1998 .

[65]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[66]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[67]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[68]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[69]  C. Watkins Dynamic Alignment Kernels , 1999 .

[70]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[71]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[72]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[73]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[74]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[75]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[76]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[77]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[78]  A. E. Hoerl,et al.  Ridge regression: biased estimation for nonorthogonal problems , 2000 .

[79]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[80]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[81]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[82]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[83]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[84]  Jason Weston,et al.  Kernel methods for Multi-labelled classification and Categ orical regression problems , 2001, NIPS 2001.

[85]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[86]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[87]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[88]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[89]  Erkki Oja,et al.  Independent Component Analysis , 2001 .

[90]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[91]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[92]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[93]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[94]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[95]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[96]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[97]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[98]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[99]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[100]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[101]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[102]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[103]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[104]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[105]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[106]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003 .

[107]  R. Kondor,et al.  Bhattacharyya and Expected Likelihood Kernels , 2003 .

[108]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[109]  Yoram Singer,et al.  Log-Linear Models for Label Ranking , 2003, NIPS.

[110]  Matthias Hein,et al.  Maximal Margin Classification for Metric Spaces , 2003, COLT.

[111]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[112]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[113]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[114]  Holger Wendland,et al.  Scattered Data Approximation: Conditionally positive definite functions , 2004 .

[115]  Zaïd Harchaoui,et al.  A Machine Learning Approach to Conjoint Analysis , 2004, NIPS.

[116]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[117]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[118]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[119]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[120]  Thomas Hofmann,et al.  Gaussian process classification for segmenting and annotating sequences , 2004, ICML.

[121]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[122]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[123]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[124]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[125]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[126]  Bernhard Schölkopf,et al.  Iterative kernel principal component analysis for image modeling , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[127]  Jason Weston,et al.  A general regression technique for learning transductions , 2005, ICML '05.

[128]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[129]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[130]  P. Bickel,et al.  Consistent independent component analysis and prewhitening , 2005, IEEE Transactions on Signal Processing.

[131]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[132]  Koby Crammer,et al.  Loss Bounds for Online Category Ranking , 2005, COLT.

[133]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[134]  Bernhard Schölkopf,et al.  Kernel Constrained Covariance for Dependence Measurement , 2005, AISTATS.

[135]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[136]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[137]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[138]  Thomas Hofmann,et al.  A Review of Kernel Methods in Machine Learning , 2006 .

[139]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[140]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[141]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[142]  Gunnar Rätsch,et al.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning , 2006, PLoS Comput. Biol..

[143]  Matthew K Doherty,et al.  Gene prediction with conditional random fields , 2007 .

[144]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[145]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[146]  D. Hilbert Grundzuge Einer Allgemeinen Theorie Der Linearen Integralgleichungen , 2009 .

[147]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .