Optimising Kernel Parameters and Regularisation Coefficients for Non-linear Discriminant Analysis

In this paper we consider a novel Bayesian interpretation of Fisher's discriminant analysis. We relate Rayleigh's coefficient to a noise model that minimises a cost based on the most probable class centres and that abandons the 'regression to the labels' assumption used by other algorithms. Optimisation of the noise model yields a direction of discrimination equivalent to Fisher's discriminant, and with the incorporation of a prior we can apply Bayes' rule to infer the posterior distribution of the direction of discrimination. Nonetheless, we argue that an additional constraining distribution has to be included if sensible results are to be obtained. Going further, with the use of a Gaussian process prior we show the equivalence of our model to a regularised kernel Fisher's discriminant. A key advantage of our approach is the facility to determine kernel parameters and the regularisation coefficient through the optimisation of the marginal log-likelihood of the data. An added bonus of the new formulation is that it enables us to link the regularisation coefficient with the generalisation error.

[1]  Gunnar Rätsch,et al.  Soft Margins for AdaBoost , 2001, Machine Learning.

[2]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[3]  Bernhard Schölkopf,et al.  Estimating a Kernel Fisher Discriminant in the Presence of Label Noise , 2001, ICML.

[4]  Volker Roth,et al.  Outlier Detection with One-class Kernel Fisher Discriminants , 2004, NIPS.

[5]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[6]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[7]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[8]  A. O'Hagan,et al.  Curve Fitting and Optimal Design for Prediction , 1978 .

[9]  Robert B. Gramacy,et al.  Gaussian processes and limiting linear models , 2008, Comput. Stat. Data Anal..

[10]  Christopher K. I. Williams Prediction with Gaussian Processes: From Linear Regression to Linear Prediction and Beyond , 1999, Learning in Graphical Models.

[11]  Ian T. Nabney,et al.  Netlab: Algorithms for Pattern Recognition , 2002 .

[12]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[13]  Deepak K. Agarwal,et al.  Shrinkage estimator generalizations of Proximal Support Vector Machines , 2002, KDD.

[14]  Glenn Fung,et al.  Proximal support vector machine classifiers , 2001, KDD '01.

[15]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[16]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[17]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[18]  Anthony Widjaja,et al.  Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond , 2003, IEEE Transactions on Neural Networks.

[19]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[20]  Sebastian Mika,et al.  Kernel Fisher Discriminants , 2003 .

[21]  Johan A. K. Suykens,et al.  Bayesian Framework for Least-Squares Support Vector Machine Classifiers, Gaussian Processes, and Kernel Fisher Discriminant Analysis , 2002, Neural Computation.

[22]  R. Shah,et al.  Least Squares Support Vector Machines , 2022 .

[23]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[24]  K. Lang,et al.  Learning to tell two spirals apart , 1988 .

[25]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[26]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[27]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[28]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[29]  Neil D. Lawrence,et al.  Fast Sparse Gaussian Process Methods: The Informative Vector Machine , 2002, NIPS.