Bayes Optimal Hyperplanes ! Maximal Margin Hyperplanes

Maximal margin classifiers are a core technology in modern machine learning. They have strong theoretical justifications and have shown empirical successes. We provide an alternative justification for maximal margin hyperplane classifiers by relating them to Bayes optimal classifiers that use Parzen windows estimations with Gaussian kernels. For any value of the smoothing parameter (the width of the Gaussian kernels), the Bayes optimal classifier defines a density over the space of instances. We define the Bayes optimal hyperplaneto be the hyperplane decision boundary that gives lowest probability of classification error relative to this densit y. We show that, for linearly separable data, as we reduce the smoothing parameter to zero, a hyperplane is the Bayes optimal hyperplane if and only if it is the maximal margin hyperplane. We also analyze the behavior of the Bayes optimal hyperplane for non-linearly-separable data, showing that it has a very natural form. We explore the idea of using the hyperplane that is optimal relative to a density with some small non-zero kernel width, and present some promising preliminary results.

[1]  W. Highleyman Linear Decision Functions, with Application to Pattern Recognition , 1962, Proceedings of the IRE.

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[4]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[5]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Nir Friedman,et al.  Building Classifiers Using Bayesian Networks , 1996, AAAI/IAAI, Vol. 2.

[10]  Yoav Freund,et al.  Boosting the margin: A new explanation for the effectiveness of voting methods , 1997, ICML.

[11]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[12]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[13]  Nello Cristianini,et al.  Bayesian Classifiers Are Large Margin Hyperplanes in a Hilbert Space , 1998, ICML.

[14]  Alexander J. Smola,et al.  Support Vector Machine Reference Manual , 1998 .

[15]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[16]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .