Kernel methods in machine learning

We review machine learning methods employing positive definite kernels. These methods formulate learning and estimation problems in a reproducing kernel Hilbert space (RKHS) of functions defined on the data domain, expanded in terms of a kernel. Working in linear spaces of function has the benefit of facilitating the construction and analysis of learning algorithms while at the same time allowing large classes of functions. The latter include nonlinear functions as well as functions defined on nonvectorial data. We cover a wide range of methods, ranging from binary classifiers to sophisticated methods for estimation with structured data.

[1]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[2]  S. Bochner Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse , 1933 .

[3]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[4]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[5]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[6]  R. Fortet,et al.  Convergence de la répartition empirique vers la répartition théorique , 1953 .

[7]  Walter W Garvin,et al.  Introduction to Linear Programming , 2018, Linear Programming and Resource Allocation Modeling.

[8]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[9]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[10]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[11]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[12]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[13]  E. Parzen STATISTICAL INFERENCE ON TIME SERIES BY RKHS METHODS. , 1970 .

[14]  J. Kettenring,et al.  Canonical Analysis of Several Sets of Variables , 2022 .

[15]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  S. R. Searle Linear Models , 1971 .

[18]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[19]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[20]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[21]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[22]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[23]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[24]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[25]  J. F. C. Kingman,et al.  Information and Exponential Families in Statistical Theory , 1980 .

[26]  M. Loève,et al.  Probability Theory II (4th ed.). , 1979 .

[27]  W. Steiger,et al.  Least Absolute Deviations: Theory, Applications and Algorithms , 1984 .

[28]  V. A. Morozov,et al.  Methods for Solving Incorrectly Posed Problems , 1984 .

[29]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[30]  B. Yandell,et al.  Semi-Parametric Generalized Linear Models. , 1985 .

[31]  B. Yandell,et al.  Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[32]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[33]  R. Fletcher Practical Methods of Optimization , 1988 .

[34]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[35]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[36]  G. Wahba Spline models for observational data , 1990 .

[37]  Steffen L. Lauritzen,et al.  Bayesian updating in causal probabilistic networks by local computations , 1990 .

[38]  D. Mason,et al.  Generalized quantile processes , 1992 .

[39]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[40]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[41]  A. P. Dawid,et al.  Applications of a general propagation algorithm for probabilistic expert systems , 1992 .

[42]  M. Murray,et al.  Differential Geometry and Statistics , 1993 .

[43]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[44]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[45]  P. Sen,et al.  Restricted canonical correlations , 1994 .

[46]  Dimitri P. Bertsekas,et al.  Nonlinear Programming , 1997 .

[47]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[48]  C. Micchelli,et al.  Functions that preserve families of positive semidefinite matrices , 1995 .

[49]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[50]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[51]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[52]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[53]  David M. Magerman,et al.  Learning grammatical stucture using statistical decision-trees , 1996, ICGI.

[54]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[55]  Shun-ichi Amari,et al.  Adaptive Online Learning Algorithms for Blind Separation: Maximum Entropy and Minimum Mutual Information , 1997, Neural Computation.

[56]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[57]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[58]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[59]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[60]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[61]  T. Ens,et al.  Blind signal separation : statistical principles , 1998 .

[62]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[63]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[64]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[65]  J. Dauxois,et al.  Nonlinear canonical analysis and independence tests , 1998 .

[66]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[67]  A. J. Bell,et al.  A Unifying Information-Theoretic Framework for Independent Component Analysis , 2000 .

[68]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[69]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[70]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[71]  C. Watkins Dynamic Alignment Kernels , 1999 .

[72]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[73]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[74]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[75]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[76]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[77]  Ralf Herbrich,et al.  Large margin rank boundaries for ordinal regression , 2000 .

[78]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[79]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[80]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[81]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[82]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[83]  Bernhard Schölkopf,et al.  Dynamic Alignment Kernels , 2000 .

[84]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[85]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[86]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[87]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[88]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[89]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[90]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[91]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[92]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[93]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[94]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[95]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[96]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[97]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[98]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[99]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[100]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[101]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[102]  William H. Press,et al.  Numerical recipes in C , 2002 .

[103]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[104]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[105]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[106]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[107]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[108]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[109]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[110]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[111]  Bernhard Schölkopf,et al.  An Introduction to Support Vector Machines , 2003 .

[112]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[113]  Gunnar Rätsch,et al.  Constructing Descriptive and Discriminative Nonlinear Features: Rayleigh Coefficients in Kernel Feature Spaces , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[114]  R. Kondor,et al.  Bhattacharyya and Expected Likelihood Kernels , 2003 .

[115]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[116]  Yoram Singer,et al.  Log-Linear Models for Label Ranking , 2003, NIPS.

[117]  Matthias Hein,et al.  Maximal Margin Classification for Metric Spaces , 2003, COLT.

[118]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[119]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[120]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[121]  Zaïd Harchaoui,et al.  A Machine Learning Approach to Conjoint Analysis , 2004, NIPS.

[122]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[123]  Matthias W. Seeger,et al.  Gaussian Processes For Machine Learning , 2004, Int. J. Neural Syst..

[124]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[125]  Bernhard Schölkopf,et al.  A kernel view of the dimensionality reduction of manifolds , 2004, ICML.

[126]  Thomas Hofmann,et al.  Gaussian process classification for segmenting and annotating sequences , 2004, ICML.

[127]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[128]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[129]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[130]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[131]  Bernhard Schölkopf,et al.  Iterative kernel principal component analysis for image modeling , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[132]  Jason Weston,et al.  A general regression technique for learning transductions , 2005, ICML '05.

[133]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[134]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[135]  P. Bickel,et al.  Consistent independent component analysis and prewhitening , 2005, IEEE Transactions on Signal Processing.

[136]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[137]  Andrew McCallum,et al.  Gene Prediction with Conditional Random Fields , 2005 .

[138]  Koby Crammer,et al.  Loss Bounds for Online Category Ranking , 2005, COLT.

[139]  S. Boucheron,et al.  Theory of classification : a survey of some recent advances , 2005 .

[140]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[141]  Bernhard Schölkopf,et al.  Kernel Constrained Covariance for Dependence Measurement , 2005, AISTATS.

[142]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[143]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[144]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[145]  Thomas Hofmann,et al.  A Review of Kernel Methods in Machine Learning , 2006 .

[146]  Hans-Peter Kriegel,et al.  Integrating structured biological data by Kernel Maximum Mean Discrepancy , 2006, ISMB.

[147]  Alexander J. Smola,et al.  Binet-Cauchy Kernels on Dynamical Systems and its Application to the Analysis of Dynamic Scenes , 2007, International Journal of Computer Vision.

[148]  Gökhan BakIr,et al.  Predicting Structured Data , 2008 .

[149]  Gunnar Rätsch,et al.  Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning , 2006, PLoS Comput. Biol..

[150]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[151]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[152]  D. Hilbert Grundzuge Einer Allgemeinen Theorie Der Linearen Integralgleichungen , 2009 .

[153]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .