A finite sample analysis of the Naive Bayes classifier

We revisit, from a statistical learning perspective, the classical decision-theoretic problem of weighted expert voting. In particular, we examine the consistency (both asymptotic and finitary) of the optimal Naive Bayes weighted majority and related rules. In the case of known expert competence levels, we give sharp error estimates for the optimal rule. We derive optimality results for our estimates and also establish some structural characterizations. When the competence levels are unknown, they must be empirically estimated. We provide frequentist and Bayesian analyses for this situation. Some of our proof techniques are non-standard and may be of independent interest. Several challenging open problems are posed, and experimental results are provided to illustrate the theory.

[1]  Lawrence K. Saul,et al.  Large Deviation Methods for Approximate Probabilistic Inference , 1998, UAI.

[2]  Jacob Goldberger,et al.  Distilling the wisdom of crowds: weighted aggregation of decisions on multiple issues , 2009, Autonomous Agents and Multi-Agent Systems.

[3]  D. Berend,et al.  On the concentration of the missing mass , 2012, 1210.3248.

[4]  Nicolas de Condorcet Essai Sur L'Application de L'Analyse a la Probabilite Des Decisions Rendues a la Pluralite Des Voix , 2009 .

[5]  Daniel Berend,et al.  Minimum KL-Divergence on Complements of $L_{1}$ Balls , 2012, IEEE Transactions on Information Theory.

[6]  Luis E. Ortiz,et al.  Concentration Inequalities for the Missing Mass and for Histogram Rule Error , 2003, J. Mach. Learn. Res..

[7]  Yuval Kluger,et al.  Ranking and combining multiple predictors without labeled data , 2013, Proceedings of the National Academy of Sciences.

[8]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[9]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[10]  François Laviolette,et al.  PAC-Bayes Risk Bounds for Stochastic Averages and Majority Votes of Sample-Compressed Classifiers , 2007, J. Mach. Learn. Res..

[11]  Shmuel Nitzan,et al.  Optimal Decision Rules in Uncertain Dichotomous Choice Situations , 1982 .

[12]  Michael I. Jordan,et al.  A Variational Approach to Bayesian Logistic Regression Models and their Extensions , 1997, AISTATS.

[13]  E. S. Pearson,et al.  On the Problem of the Most Efficient Tests of Statistical Hypotheses , 1933 .

[14]  Igal Sason,et al.  Concentration of Measure Inequalities in Information Theory, Communications, and Coding , 2012, Found. Trends Commun. Inf. Theory.

[15]  D. Berend,et al.  When is Condorcet's Jury Theorem valid? , 1998 .

[16]  Daniel Berend,et al.  Monotonicity in Condorcet’s Jury Theorem with dependent voters , 2007, Soc. Choice Welf..

[17]  Tony Jebara,et al.  Multitask Sparsity via Maximum Entropy Discrimination , 2011, J. Mach. Learn. Res..

[18]  Philip M. Long,et al.  On the Necessity of Irrelevant Variables , 2011, ICML.

[19]  L. Kontorovich Obtaining Measure Concentration from Markov Contraction , 2007, 0711.0987.

[20]  Thomas Hofmann,et al.  PAC-Bayes Bounds for the Risk of the Majority Vote and the Variance of the Gibbs Classifier , 2007 .

[21]  Daniel Berend,et al.  Consistency of weighted majority votes , 2013, NIPS.

[22]  Chao Gao,et al.  Minimax Optimal Convergence Rates for Estimating Ground Truth from Crowdsourced Labels , 2013, 1310.5764.

[23]  Yishay Mansour,et al.  The Value of Observation for Monitoring Dynamic Systems , 2007, IJCAI.

[24]  Anna Choromanska,et al.  Majorization for CRFs and Latent Likelihoods , 2012, NIPS.

[25]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[26]  François Laviolette,et al.  From PAC-Bayes Bounds to Quadratic Programs for Majority Votes , 2011, ICML.

[27]  Csaba Szepesvári,et al.  Empirical Bernstein stopping , 2008, ICML '08.

[28]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[29]  A. P. Dawid,et al.  Maximum Likelihood Estimation of Observer Error‐Rates Using the EM Algorithm , 1979 .

[30]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[31]  Jacob Goldberger,et al.  Beyond Condorcet: optimal aggregation rules using voting records , 2012 .

[32]  Massimiliano Pontil,et al.  Empirical Bernstein Bounds and Sample-Variance Penalization , 2009, COLT.

[33]  Yoav Freund,et al.  Boosting: Foundations and Algorithms , 2012 .

[34]  Manfred K. Warmuth,et al.  The weighted majority algorithm , 1989, 30th Annual Symposium on Foundations of Computer Science.

[35]  Elad Eban,et al.  Discrete Chebyshev Classifiers , 2014, ICML.

[36]  Hongwei Li,et al.  Error Rate Bounds in Crowdsourcing Models , 2013, ArXiv.

[37]  Moshe Tennenholtz,et al.  Robust Probabilistic Inference , 2014, SODA.

[38]  Gideon Schechtman,et al.  Measure Concentration of Strongly Mixing Processes with Applications , 2007 .

[39]  Frank Proschan,et al.  Modelling dependence in simple and indirect majority systems , 1989 .

[40]  Csaba Szepesvári,et al.  Tuning Bandit Algorithms in Stochastic Environments , 2007, ALT.

[41]  D. Berend,et al.  A sharp estimate of the binomial mean absolute deviation with applications , 2013 .

[42]  Eckhard Schlemm The Kearns–Saul Inequality for Bernoulli and Poisson-Binomial Distributions , 2016 .