论文信息 - On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification

On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification

The Naive Bayes classifier exists in different versions. One version, called multi-variate Bernoulli or binary independence model, uses binary word occurrence vectors, while the multinomial model uses word frequency counts. Many publications cite this difference as the main reason for the superior performance of the multinomial Naive Bayesclassifier. We argue that this is not true. We show that when all word frequency information is eliminated from the document vectors, the multinomial Naive Bayes model performs even better. Moreover, we argue that the main reason for the difference in performance is the way that negative evidence, i.e. evidence from words that do not occur in a document, is incorporated in the model. Therefore, this paper aims at a better understanding and a clarification of the difference between the two probabilistic models of Naive Bayes.

Karl-Michael Schneider

[1] Andrew McCallum,et al. A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[2] David R. Karger,et al. Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[3] Céline Rouveirol,et al. Machine Learning: ECML-98 , 1998, Lecture Notes in Computer Science.

[4] Slava M. Katz. Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[5] Tom M. Mitchell,et al. Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[6] Anoop Sarkar,et al. Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003) , 2003 .

[7] Thomas M. Cover,et al. Elements of Information Theory , 2005 .

[8] Pedro M. Domingos,et al. On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[9] David Madigan,et al. On the Naive Bayes Model for Text Categorization , 2003, AISTATS.

[10] David D. Lewis,et al. Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[11] Georgios Paliouras,et al. Learning to Filter Spam E-Mail: A Comparison of a Naive Bayesian and a Memory-Based Approach , 2000, ArXiv.

[12] Kenneth Ward Church,et al. Poisson mixtures , 1995, Natural Language Engineering.