A Tutorial Review of RKHS Methods in Machine Learning

Over the last ten years, estimation and learning methods utilizing positive definite kernels have become rather popular, particularly in machine learning. Since these methods have a stronger mathematical slant than earlier machine learning methods (e.g., neural networks), there is also significant interest in the statistical and mathematical community for these methods. The present review aims to summarize the state of the art on a conceptual level. In doing so, we build on various sources (including the books [Vapnik, 1998, Burges, 1998, Cristianini and Shawe-Taylor, 2000, Herbrich, 2002] and in particular [Schölkopf and Smola, 2002]), but we also add a fair amount of recent material which helps unifying the exposition. The main idea of all the described methods can be summarized in one paragraph. Traditionally, theory and algorithms of machine learning and statistics has been very well developed for the linear case. Real world data analysis problems, on the other hand, often requires nonlinear methods to detect the kind of dependences that allow successful prediction of properties of interest. By using a positive definite kernel, one can sometimes have the best of both worlds. The kernel corresponds to a dot product in a (usually high dimensional) feature space. In this space, our estimation methods are linear, but as long as we can formulate everything in terms of kernel evaluations, we never explicitly have to work in the high dimensional feature space.

[1]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[2]  Shahar Mendelson,et al.  Rademacher averages and phase transitions in Glivenko-Cantelli classes , 2002, IEEE Trans. Inf. Theory.

[3]  Quoc V. Le,et al.  Nonparametric Quantile Regression , 2005 .

[4]  I. J. Schoenberg Metric spaces and completely monotone functions , 1938 .

[5]  Bernhard Schölkopf,et al.  On a Kernel-Based Method for Pattern Recognition, Regression, Approximation, and Operator Inversion , 1998, Algorithmica.

[6]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[7]  D. Nolan The excess-mass ellipsoid , 1991 .

[8]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9]  William T. Freeman,et al.  Understanding belief propagation and its generalizations , 2003 .

[10]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[11]  Allan Pinkus,et al.  Strictly Positive Definite Functions on a Real Inner Product Space , 2004, Adv. Comput. Math..

[12]  John Shawe-Taylor,et al.  A Column Generation Algorithm For Boosting , 2000, ICML.

[13]  Ben Taskar,et al.  Max-Margin Markov Networks , 2003, NIPS.

[14]  Y. Makovoz Random Approximants and Neural Networks , 1996 .

[15]  Xiaojin Zhu,et al.  Kernel conditional random fields: representation and clique selection , 2004, ICML.

[16]  Martin J. Wainwright,et al.  Semidefinite Relaxations for Approximate Inference on Graphs with Cycles , 2003, NIPS.

[17]  Christopher J. C. Burges,et al.  Simplified Support Vector Decision Rules , 1996, ICML.

[18]  Thomas Gärtner,et al.  Multi-Instance Kernels , 2002, ICML.

[19]  D. Mason,et al.  Generalized quantile processes , 1992 .

[20]  Arthur E. Hoerl,et al.  Ridge Regression: Biased Estimation for Nonorthogonal Problems , 2000, Technometrics.

[21]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[22]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[23]  Thorsten Joachims,et al.  A support vector method for multivariate performance measures , 2005, ICML.

[24]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[25]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[26]  Gunnar Rätsch,et al.  Robust Boosting via Convex Optimization: Theory and Applications , 2007 .

[27]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[28]  R. Fletcher Practical Methods of Optimization , 1988 .

[29]  Thomas Hofmann,et al.  Unifying collaborative and content-based filtering , 2004, ICML.

[30]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[31]  Ben Taskar,et al.  Max-Margin Parsing , 2004, EMNLP.

[32]  Christian Berg,et al.  Potential Theory on Locally Compact Abelian Groups , 1975 .

[33]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[34]  John Shawe-Taylor,et al.  Structural Risk Minimization Over Data-Dependent Hierarchies , 1998, IEEE Trans. Inf. Theory.

[35]  J. Weston,et al.  Support vector regression with ANOVA decomposition kernels , 1999 .

[36]  Alexander J. Smola,et al.  Binet-Cauchy Kernels , 2004, NIPS.

[37]  Ralf Herbrich,et al.  Learning Kernel Classifiers: Theory and Algorithms , 2001 .

[38]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[39]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[40]  J. Kettenring,et al.  Canonical Analysis of Several Sets of Variables , 2022 .

[41]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[42]  Eugene Charniak,et al.  Statistical Parsing with a Context-Free Grammar and Word Statistics , 1997, AAAI/IAAI.

[43]  Jason Weston,et al.  Mismatch String Kernels for SVM Protein Classification , 2002, NIPS.

[44]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[45]  Hisashi Kashima,et al.  Marginalized Kernels Between Labeled Graphs , 2003, ICML.

[46]  G. Wahba,et al.  Some results on Tchebycheffian spline functions , 1971 .

[47]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[48]  Vladimir Koltchinskii,et al.  Rademacher penalties and structural risk minimization , 2001, IEEE Trans. Inf. Theory.

[49]  M. Fiedler Algebraic connectivity of graphs , 1973 .

[50]  Vapnik,et al.  SVMs for Histogram Based Image Classification , 1999 .

[51]  Bernhard Schölkopf,et al.  Support vector learning , 1997 .

[52]  Bernhard Schölkopf,et al.  Kernel Dependency Estimation , 2002, NIPS.

[53]  N. Cristianini,et al.  On Kernel-Target Alignment , 2001, NIPS.

[54]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[55]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[56]  Bernhard Schölkopf,et al.  Measuring Statistical Dependence with Hilbert-Schmidt Norms , 2005, ALT.

[57]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[58]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[59]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[60]  Gunnar Rätsch,et al.  Adapting Codes and Embeddings for Polychotomies , 2002, NIPS.

[61]  Zaïd Harchaoui,et al.  A Machine Learning Approach to Conjoint Analysis , 2004, NIPS.

[62]  P. Rujan A Fast Method for Calculating the Perceptron with Maximal Stability , 1993 .

[63]  S. Klinke,et al.  Exploratory Projection Pursuit , 1995 .

[64]  Alexander J. Smola,et al.  Support Vector Method for Function Approximation, Regression Estimation and Signal Processing , 1996, NIPS.

[65]  W. Krauth,et al.  Learning algorithms with optimal stability in neural networks , 1987 .

[66]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[67]  Koby Crammer,et al.  On the Learnability and Design of Output Codes for Multiclass Problems , 2002, Machine Learning.

[68]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[69]  B. Yandell,et al.  Automatic Smoothing of Regression Functions in Generalized Linear Models , 1986 .

[70]  O. Mangasarian,et al.  Robust linear programming discrimination of two linearly inseparable sets , 1992 .

[71]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[72]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[73]  C. Micchelli,et al.  Functions that preserve families of positive semidefinite matrices , 1995 .

[74]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[75]  Frank Jensen,et al.  Optimal junction Trees , 1994, UAI.

[76]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[77]  Bernhard Schölkopf,et al.  Iterative kernel principal component analysis for image modeling , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[79]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[80]  G. Wahba Spline models for observational data , 1990 .

[81]  Thomas Gärtner,et al.  A survey of kernels for structured data , 2003, SKDD.

[82]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[83]  Bernhard Schölkopf,et al.  Comparison of View-Based Object Recognition Algorithms Using Realistic 3D Models , 1996, ICANN.

[84]  Michael Collins,et al.  Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms , 2002, EMNLP.

[85]  D. Cox,et al.  Asymptotic Analysis of Penalized Likelihood and Related Estimators , 1990 .

[86]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[87]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[88]  A. Tsybakov On nonparametric estimation of density level sets , 1997 .

[89]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[90]  Thomas Hofmann,et al.  Hidden Markov Support Vector Machines , 2003, ICML.

[91]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[92]  T. Sager An Iterative Method for Estimating a Multivariate Mode and Isopleth , 1979 .

[93]  G. Wahba,et al.  Smoothing spline ANOVA for exponential families, with application to the Wisconsin Epidemiological Study of Diabetic Retinopathy : the 1994 Neyman Memorial Lecture , 1995 .

[94]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[95]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[96]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[97]  Colin McDiarmid,et al.  Surveys in Combinatorics, 1989: On the method of bounded differences , 1989 .

[98]  Koby Crammer Online Learning for Complex Cat-egorial Problems , 2005 .

[99]  Ingo Steinwart,et al.  Support Vector Machines are Universally Consistent , 2002, J. Complex..

[100]  Alexander J. Smola,et al.  Regression estimation with support vector learning machines , 1996 .

[101]  Andrew McCallum,et al.  A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance , 2005, UAI.

[102]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[103]  Alexander J. Smola,et al.  Fast Kernels for String and Tree Matching , 2002, NIPS.

[104]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[105]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[106]  Tong Zhang Statistical behavior and consistency of classification methods based on convex risk minimization , 2003 .

[107]  Eugene Charniak,et al.  A Maximum-Entropy-Inspired Parser , 2000, ANLP.

[108]  Gunnar Rätsch,et al.  Kernel PCA and De-Noising in Feature Spaces , 1998, NIPS.

[109]  W. Rudin,et al.  Fourier Analysis on Groups. , 1965 .

[110]  B. Schölkopf,et al.  Efficient face detection by a cascaded support–vector machine expansion , 2004, Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences.

[111]  Bart De Moor,et al.  Subspace angles between ARMA models , 2002, Syst. Control. Lett..

[112]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[113]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[114]  C. Watkins Dynamic Alignment Kernels , 1999 .

[115]  Ralf Herbrich,et al.  Algorithmic Luckiness , 2001, J. Mach. Learn. Res..

[116]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[117]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[118]  Lior Wolf,et al.  Learning over Sets using Kernel Principal Angles , 2003, J. Mach. Learn. Res..

[119]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[120]  E. Parzen STATISTICAL INFERENCE ON TIME SERIES BY RKHS METHODS. , 1970 .

[121]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[122]  Bernhard Schölkopf,et al.  The connection between regularization operators and support vector kernels , 1998, Neural Networks.

[123]  K. Karhunen Zur Spektraltheorie stochastischer prozesse , 1946 .

[124]  J. Besag Spatial Interaction and the Statistical Analysis of Lattice Systems , 1974 .

[125]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[126]  Michael I. Jordan,et al.  Kernel independent component analysis , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[127]  Bernhard Schölkopf,et al.  New Support Vector Algorithms , 2000, Neural Computation.

[128]  Herbert Meschkowski,et al.  Hilbertsche Räume mit Kernfunktion , 1962 .

[129]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[130]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[131]  Shahar Mendelson,et al.  A Few Notes on Statistical Learning Theory , 2002, Machine Learning Summer School.

[132]  Thomas Hofmann,et al.  Support vector machine learning for interdependent and structured output spaces , 2004, ICML.

[133]  P. Sen,et al.  Restricted canonical correlations , 1994 .

[134]  Thore Graepel,et al.  Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[135]  J. Kahane Some Random Series of Functions , 1985 .

[136]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[137]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[138]  J. Dauxois,et al.  Nonlinear canonical analysis and independence tests , 1998 .

[139]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[140]  Ingo Steinwart,et al.  On the Influence of the Kernel on the Consistency of Support Vector Machines , 2002, J. Mach. Learn. Res..

[141]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[142]  Thomas Hofmann,et al.  Gaussian process classification for segmenting and annotating sequences , 2004, ICML.

[143]  Bernhard Schölkopf,et al.  Sparse Greedy Matrix Approximation for Machine Learning , 2000, International Conference on Machine Learning.

[144]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[145]  B. Yandell,et al.  Semi-Parametric Generalized Linear Models. , 1985 .

[146]  Andrew McCallum,et al.  Gene Prediction with Conditional Random Fields , 2005 .

[147]  Christopher K. I. Williams,et al.  Pascal Visual Object Classes Challenge Results , 2005 .

[148]  Koby Crammer,et al.  Loss Bounds for Online Category Ranking , 2005, COLT.

[149]  A. Tsybakov,et al.  Optimal aggregation of classifiers in statistical learning , 2003 .

[150]  D Haussler,et al.  Knowledge-based analysis of microarray gene expression data by using support vector machines. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[151]  Thomas Hofmann,et al.  Large margin methods for label sequence learning , 2003, INTERSPEECH.

[152]  Leo Breiman,et al.  Prediction Games and Arcing Algorithms , 1999, Neural Computation.

[153]  Risi Kondor,et al.  Diffusion kernels on graphs and other discrete structures , 2002, ICML 2002.

[154]  Bernhard Schölkopf,et al.  Support Vector Novelty Detection Applied to Jet Engine Vibration Spectra , 2000, NIPS.

[155]  Steven A. Orszag,et al.  CBMS-NSF REGIONAL CONFERENCE SERIES IN APPLIED MATHEMATICS , 1978 .

[156]  J. Hartigan Estimation of a Convex Density Contour in Two Dimensions , 1987 .

[157]  W. Steiger,et al.  Least Absolute Deviations: Theory, Applications and Algorithms , 1984 .

[158]  J. Mercer Functions of Positive and Negative Type, and their Connection with the Theory of Integral Equations , 1909 .

[159]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[160]  V. A. Morozov,et al.  Methods for Solving Incorrectly Posed Problems , 1984 .

[161]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[162]  Christopher K. I. Williams,et al.  Using the Nyström Method to Speed Up Kernel Machines , 2000, NIPS.

[163]  Richard F. Gunst,et al.  Applied Regression Analysis , 1999, Technometrics.

[164]  John D. Lafferty,et al.  Inducing Features of Random Fields , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[165]  A. P. Dawid,et al.  Applications of a general propagation algorithm for probabilistic expert systems , 1992 .

[166]  C. Berg,et al.  Harmonic Analysis on Semigroups , 1984 .

[167]  Richard Cole,et al.  Faster suffix tree construction with missing suffix links , 2000, STOC '00.

[168]  Bernhard Schölkopf,et al.  Training Invariant Support Vector Machines , 2002, Machine Learning.

[169]  Gunnar Rätsch,et al.  Engineering Support Vector Machine Kerneis That Recognize Translation Initialion Sites , 2000, German Conference on Bioinformatics.

[170]  W. Polonik Minimum volume sets and generalized quantile processes , 1997 .

[171]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[172]  Thomas Hofmann,et al.  Exponential Families for Conditional Random Fields , 2004, UAI.

[173]  Mathieu Raffinot,et al.  Fast Regular Expression Search , 1999, WAE.

[174]  David Haussler,et al.  Probabilistic kernel regression models , 1999, AISTATS.

[175]  Kenneth O. Kortanek,et al.  Semi-Infinite Programming: Theory, Methods, and Applications , 1993, SIAM Rev..

[176]  M. Aizerman,et al.  Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning , 1964 .

[177]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[178]  Alexander J. Smola,et al.  Advances in Large Margin Classifiers , 2000 .

[179]  A. Buja,et al.  Projection Pursuit Indexes Based on Orthonormal Function Expansions , 1993 .

[180]  Yoram Singer,et al.  Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers , 2000, J. Mach. Learn. Res..

[181]  O. Mangasarian Linear and Nonlinear Separation of Patterns by Linear Programming , 1965 .

[182]  Koby Crammer,et al.  On the Algorithmic Implementation of Multiclass Kernel-based Vector Machines , 2002, J. Mach. Learn. Res..

[183]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[184]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[185]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[186]  D. Bamber The area above the ordinal dominance graph and the area below the receiver operating characteristic graph , 1975 .

[187]  Yukio Shibata,et al.  On the tree representation of chordal graphs , 1988, J. Graph Theory.

[188]  S. Sinha A Duality Theorem for Nonlinear Programming , 1966 .

[189]  David M. Magerman,et al.  Learning grammatical stucture using statistical decision-trees , 1996, ICGI.

[190]  Bernhard Schölkopf,et al.  Kernel Constrained Covariance for Dependence Measurement , 2005, AISTATS.

[191]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[192]  Noga Alon,et al.  Scale-sensitive dimensions, uniform convergence, and learnability , 1997, JACM.

[193]  John W. Tukey,et al.  A Projection Pursuit Algorithm for Exploratory Data Analysis , 1974, IEEE Transactions on Computers.

[194]  Michael I. Jordan Graphical Models , 2003 .

[195]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[196]  Alexander J. Smola,et al.  Support Vector Regression Machines , 1996, NIPS.

[197]  Michael Collins,et al.  Convolution Kernels for Natural Language , 2001, NIPS.

[198]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[199]  H. Kashima,et al.  Kernels for graphs , 2004 .

[200]  Richard J. Martin A metric for ARMA processes , 2000, IEEE Trans. Signal Process..

[201]  J. M. Hammersley,et al.  Markov fields on finite graphs and lattices , 1971 .

[202]  O. Bousquet THEORY OF CLASSIFICATION: A SURVEY OF RECENT ADVANCES , 2004 .

[203]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[204]  Gunnar Rätsch,et al.  Predicting Time Series with Support Vector Machines , 1997, ICANN.

[205]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[206]  R. Kondor,et al.  Bhattacharyya and Expected Likelihood Kernels , 2003 .

[207]  Walter W Garvin,et al.  Introduction to Linear Programming , 2018, Linear Programming and Resource Allocation Modeling.

[208]  S. Bochner Monotone Funktionen, Stieltjessche Integrale und harmonische Analyse , 1933 .

[209]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[210]  Shai Ben-David,et al.  On the difficulty of approximately maximizing agreements , 2000, J. Comput. Syst. Sci..

[211]  Philip M. Long,et al.  Fat-shattering and the learnability of real-valued functions , 1994, COLT '94.

[212]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[213]  Yoram Singer,et al.  Log-Linear Models for Label Ranking , 2003, NIPS.

[214]  Pietro Perona,et al.  Combining generative models and Fisher kernels for object recognition , 2005, Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1.

[215]  Matthias Hein,et al.  Maximal Margin Classification for Metric Spaces , 2003, COLT.