Bounds for the Loss in Probability of Correct Classification Under Model Based Approximation

In many pattern recognition/classification problem the true class conditional model and class probabilities are approximated for reasons of reducing complexity and/or of statistical estimation. The approximated classifier is expected to have worse performance, here measured by the probability of correct classification. We present an analysis valid in general, and easily computable formulas for estimating the degradation in probability of correct classification when compared to the optimal classifier. An example of an approximation is the Naive Bayes classifier. We show that the performance of the Naive Bayes depends on the degree of functional dependence between the features and labels. We provide a sufficient condition for zero loss of performance, too.

[1]  N. Glick Sample-Based Multinomial Classification , 1973 .

[2]  Godfried T. Toussaint Polynomial Representation of Classifiers with Independent Discrete-Valued Features , 1972, IEEE Transactions on Computers.

[3]  P. Scheinok,et al.  Symptom diagnosis: Bayes's theorem and Bahadur's distribution. , 1972, International journal of bio-medical computing.

[4]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[5]  Lineu C. Barbosa,et al.  Maximum likelihood sequence estimators: A geometric view , 1989, IEEE Trans. Inf. Theory.

[6]  Kevin B. Korb,et al.  Bayesian Artificial Intelligence , 2004, Computer science and data analysis series.

[7]  N. Glick Sample-Based Classification Procedures Derived from Density Estimators , 1972 .

[8]  Hans Ulrich Simon,et al.  Probably almost Bayes decisions , 1991, COLT '91.

[9]  J. Teugels Some representations of the multivariate Bernoulli and binomial distributions , 1990 .

[10]  Suguru Arimoto,et al.  Information-Theoretical Considerations on Estimation Problems , 1971, Inf. Control..

[11]  T. Vilmansen On Dependence and Discrimination in Pattern Recognition , 1972, IEEE Transactions on Computers.

[12]  D. Titterington,et al.  Comparison of Discrimination Techniques Applied to a Complex Data Set of Head Injured Patients , 1981 .

[13]  Magnus Ekdahl,et al.  Approximations of Bayes Classifiers for Statistical Learning of Clusters , 2006 .

[14]  Heikki Mannila,et al.  Principles of Data Mining , 2001, Undergraduate Topics in Computer Science.

[15]  F. Y. Edgeworth,et al.  The theory of statistics , 1996 .

[16]  László Györfi,et al.  A Probabilistic Theory of Pattern Recognition , 1996, Stochastic Modelling and Applied Probability.

[17]  W. Hoeffding,et al.  Distinguishability of Sets of Distributions , 1958 .

[18]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[19]  Flemming Topsøe,et al.  Some inequalities for information divergence and related measures of discrimination , 2000, IEEE Trans. Inf. Theory.

[20]  Donald A. Pierce,et al.  Estimation of discrete multivariate densities for computer-aided differential diagnosis of disease , 1974 .

[21]  D. Moore Evaluation of Five Discrimination Procedures for Binary Variables , 1973 .

[22]  Solomon Kullback,et al.  Approximating discrete probability distributions , 1969, IEEE Trans. Inf. Theory.

[23]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[24]  J. F. Wagner,et al.  Elliptically symmetric distributions , 1968, IEEE Trans. Inf. Theory.

[25]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[26]  Simon Parsons,et al.  Principles of Data Mining by David J. Hand, Heikki Mannila and Padhraic Smyth, MIT Press, 546 pp., £34.50, ISBN 0-262-08290-X , 2004, The Knowledge Engineering Review.

[27]  Martin J. Wainwright,et al.  On divergences, surrogate loss functions, and decentralized detection , 2005, ArXiv.

[28]  David T. Brown,et al.  A Note on Approximations to Discrete Probability Distributions , 1959, Inf. Control..

[29]  Gregory F. Cooper,et al.  The Computational Complexity of Probabilistic Inference Using Bayesian Belief Networks , 1990, Artif. Intell..

[30]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[31]  Michael I. Jordan,et al.  Graphical Models, Exponential Families, and Variational Inference , 2008, Found. Trends Mach. Learn..

[32]  David Maxwell Chickering,et al.  Optimal Structure Identification With Greedy Search , 2002, J. Mach. Learn. Res..

[33]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[34]  Jon Williamson,et al.  Bayesian Nets and Causality: Philosophical and Computational Foundations , 2005 .

[35]  R. van Engelen,et al.  Approximating Bayesian belief networks by arc removal , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[36]  David J. Spiegelhalter,et al.  Probabilistic Networks and Expert Systems , 1999, Information Science and Statistics.

[37]  Giovanni Pistone,et al.  Gröbner bases and factorisation in discrete probability and Bayes , 2001, Stat. Comput..

[38]  D. Hand,et al.  Idiot's Bayes—Not So Stupid After All? , 2001 .

[39]  Ghassan Kawas Kaleh Channel Equalization for Block Transmission Systems , 1995, IEEE J. Sel. Areas Commun..

[40]  Philip M. Lewis,et al.  Approximating Probability Distributions to Reduce Storage Requirements , 1959, Information and Control.

[41]  R. Kronmal,et al.  Some Classification Procedures for Multivariate Binary Data Using Orthogonal Functions , 1976 .

[42]  Rish,et al.  An analysis of data characteristics that affect naive Bayes performance , 2001 .

[43]  Michael R. Lyu,et al.  Finite Mixture Model of Bounded Semi-naive Bayesian Networks Classifier , 2003, ICANN.

[44]  S. Lauritzen Propagation of Probabilities, Means, and Variances in Mixed Graphical Association Models , 1992 .

[45]  Toomas R. Vilmansen,et al.  Feature Evalution with Measures of Probabilistic Dependence , 1973, IEEE Transactions on Computers.

[46]  T. Koski,et al.  Probabilistic Models for Bacterial Taxonomy , 2000 .

[47]  Ricardo Vilalta,et al.  A Decomposition of Classes via Clustering to Explain and Improve Naive Bayes , 2003, ECML.

[48]  Craig Boutilier,et al.  Context-Specific Independence in Bayesian Networks , 1996, UAI.

[49]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[50]  E. J. G. Pitman,et al.  Some basic theory for statistical inference , 1979 .

[51]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[52]  T. Kailath The Divergence and Bhattacharyya Distance Measures in Signal Selection , 1967 .

[53]  David Slepian,et al.  On the Symmetrized Kronecker Power of a Matrix and Extensions of Mehler’s Formula for Hermite Polynomials , 1972 .

[54]  Wassilij Höffding,et al.  Stochastische Abhängigkeit und funktionaler Zusammenhang , 1942 .

[55]  Godfried T. Toussaint,et al.  An upper bound on the probability of misclassification in terms of Matusita's measure of affinity , 1982 .