Restricted Bayes Optimal Classifiers

We introduce the notion of restricted Bayes optimal classifiers . These classifiers attempt to combine the flexibility of the generative approach to classification with the high accuracy associated with discriminative learning. They first create a model of the joint distribution over class labels and features. Instead of choosing the decision boundary induced directly from the model, they restrict the allowable types of decision boundaries and learn the one that minimizes the probability of misclassification relative to the estimated joint distribution. In this paper, we investigate two particular instantiations of this approach. The first uses a non-parametric density estimator — Parzen Windows with Gaussian kernels — and hyperplane decision boundaries. We show that the resulting classifier is asymptotically equivalent to a maximal margin hyperplane classifier, a highly successful discriminative classifier. We therefore provide an alternative justification for maximal margin hyperplane classifiers. The second instantiation uses a mixture of Gaussians as the estimated density; in experiments on real-world data, we show that this approach allows data with missing values to be handled in a principled manner, leading to improved performance over regular discriminative approaches.

[1]  W. Highleyman Linear Decision Functions, with Application to Pattern Recognition , 1962, Proceedings of the IRE.

[2]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[6]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[7]  David J. Spiegelhalter,et al.  Machine Learning, Neural and Statistical Classification , 2009 .

[8]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[9]  Trevor J. Hastie,et al.  Discriminative vs Informative Learning , 1997, KDD.

[10]  Susan T. Dumais,et al.  Inductive learning algorithms and representations for text categorization , 1998, CIKM '98.

[11]  Nello Cristianini,et al.  Bayesian Classifiers Are Large Margin Hyperplanes in a Hilbert Space , 1998, ICML.

[12]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[13]  David Haussler,et al.  Exploiting Generative Models in Discriminative Classifiers , 1998, NIPS.

[14]  Tommi S. Jaakkola,et al.  Maximum Entropy Discrimination , 1999, NIPS.

[15]  R. Herbrich Bayesian Learning in Reproducing Kernel Hilbert Spaces , 1999 .

[16]  Peter Sollich Probabilistic interpretations and Bayesian methods for support vector machines , 1999 .