An empirical study of the naive Bayes classifier

The naive Bayesclassifiergreatly simplify learning byassumingthatfeaturesareindependent given class. Although independenceis generallya poor assumption, in practicenaiveBayesoftencompetes well with moresophisticatedclassifiers. Our broadgoal is to understandthedatacharacter isticswhichaffect theperformanceof naiveBayes. OurapproachusesMonteCarlosimulationsthatallow a systematicstudy of classificationaccuracy for several classesof randomly generatedproblems. We analyzethe impact of the distribution entropy on the classificationerror, showing that low-entropy featuredistributions yield good performanceof naive Bayes. We also demonstrate that naive Bayes works well for certain nearlyfunctional featuredependencies, thus reachingits bestperformancein twooppositecases:completely independentfeatures(as expected)and functionally dependent features(which is surprising).Anothersurprisingresultis that theaccuracy of naive Bayes is not directly correlatedwith the degree of feature dependenciesmeasuredas the classconditional mutual information betweenthe features.Instead,a betterpredictorof naiveBayesaccuracy is theamountof informationabouttheclass that is lost becauseof the independenceassumption.