A comparison of event models for naive bayes text classification

Recent work in text classification has used two different first-order probabilistic models for classification, both of which make the naive Bayes assumption. Some use a multi-variate Bernoulli model, that is, a Bayesian Network with no dependencies between words and binary word features (e.g. Larkey and Croft 1996; Koller and Sahami 1997). Others use a multinomial model, that is, a uni-gram language model with integer word counts (e.g. Lewis and Gale 1994; Mitchell 1997). This paper aims to clarify the confusion by describing the differences and details of these two models, and by empirically comparing their classification performance on five text corpora. We find that the multi-variate Bernoulli performs well with small vocabulary sizes, but that the multinomial performs usually performs even better at larger vocabulary sizes--providing on average a 27% reduction in error over the multi-variate Bernoulli model at any vocabulary size.

[1]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[2]  G. F. Ames Bacterial periplasmic transport systems: structure, mechanism, and evolution. , 1986, Annual review of biochemistry.

[3]  F. Opperdoes Compartmentation of carbohydrate metabolism in trypanosomes. , 1987, Annual review of microbiology.

[4]  I. Pastan,et al.  Molecular manipulations of the multidrug transporter: a new role for transgenic mice 1 , 1991, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[5]  D. Dubnau The regulation of genetic competence in Bacillus subtilis , 1991, Molecular microbiology.

[6]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[7]  M. Mergeay,et al.  Towards an understanding of the genetics of bacterial metal resistance. , 1991, Trends in biotechnology.

[8]  P. Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[9]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[10]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[11]  I. Pastan,et al.  Biochemistry of multidrug resistance mediated by the multidrug transporter. , 1993, Annual review of biochemistry.

[12]  Louise Guthrie,et al.  Document Classification By Machine: Theory and Practice , 1994, COLING.

[13]  B. Dreiseikelmann Translocation of DNA across bacterial membranes. , 1994, Microbiological reviews.

[14]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[15]  M. G. Lorenz,et al.  Bacterial gene transfer by natural genetic transformation in the environment. , 1994, Microbiological reviews.

[16]  R. Palmen,et al.  Bioenergetic aspects of the translocation of macromolecules across bacterial membranes. , 1994, Biochimica et biophysica acta.

[17]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[18]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[19]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[20]  W. B. CroftCenter Combining Classiiers in Text Categorization , 1996 .

[21]  R. Prasad,et al.  MULTIDRUG RESISTANCE : AN EMERGING THREAT , 1996 .

[22]  Thomas Kalt,et al.  A New Probabilistic Model of Text Classification and Retrieval , 1998 .

[23]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[24]  Hang Li,et al.  Document Classification Using a Finite Mixture Model , 1997, ACL.

[25]  Daphne Koller,et al.  Hierarchically Classifying Documents Using Very Few Words , 1997, ICML.

[26]  Prasad Tadepalli,et al.  Active Learning with Committees for Text Categorization , 1997, AAAI/IAAI.

[27]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[28]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[29]  Leah S. Larkey,et al.  Some Issues in the Automatic Classification of U.S. Patents Working Notes for the AAAI-98 Workshop on Learning for Text Categorization , 1998 .

[30]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[31]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[32]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[33]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.