A new term-weighting scheme for naïve Bayes text categorization

Purpose – Automatic text categorization has applications in several domains, for example e‐mail spam detection, sexual content filtering, directory maintenance, and focused crawling, among others. Most information retrieval systems contain several components which use text categorization methods. One of the first text categorization methods was designed using a naive Bayes representation of the text. Currently, a number of variations of naive Bayes have been discussed. The purpose of this paper is to evaluate naive Bayes approaches on text categorization introducing new competitive extensions to previous approaches.Design/methodology/approach – The paper focuses on introducing a new Bayesian text categorization method based on an extension of the naive Bayes approach. Some modifications to document representations are introduced based on the well‐known BM25 text information retrieval method. The performance of the method is compared to several extensions of naive Bayes using benchmark datasets designed fo...

[1]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[2]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[3]  James Theiler,et al.  Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space , 2003, J. Mach. Learn. Res..

[4]  W. John Wilbur,et al.  The ineffectiveness of within-document term frequency in text classification , 2008, Information Retrieval.

[5]  Guo Qiang An Effective Algorithm for Improving the Performance of Naive Bayes for Text Classification , 2010, 2010 Second International Conference on Computer Research and Development.

[6]  Gobinda G. Chowdhury,et al.  TREC: Experiment and Evaluation in Information Retrieval , 2007 .

[7]  Houkuan Huang,et al.  Feature selection for text classification with Naïve Bayes , 2009, Expert Syst. Appl..

[8]  Hae-Chang Rim,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Hakan Altinçay,et al.  Analytical evaluation of term weighting schemes for text categorization , 2010, Pattern Recognit. Lett..

[10]  Kenneth Ward Church,et al.  Poisson mixtures , 1995, Natural Language Engineering.

[11]  Wen-tau Yih,et al.  Raising the baseline for high-precision text classifiers , 2007, KDD '07.

[12]  RimHae-Chang,et al.  Some Effective Techniques for Naive Bayes Text Classification , 2006 .

[13]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[14]  Stephen E. Robertson,et al.  Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval , 1994, SIGIR '94.

[15]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  Yiming Yang,et al.  Information Filtering in TREC-9 and TDT-3: A Comparative Analysis , 2002, Information Retrieval.

[18]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[19]  G. I. Kustova,et al.  From the author , 2019, Automatic Documentation and Mathematical Linguistics.

[20]  Han Tong Loh,et al.  Imbalanced text classification: A term weighting approach , 2009, Expert Syst. Appl..

[21]  Piotr Indyk,et al.  Nearest Neighbors in High-Dimensional Spaces , 2004, Handbook of Discrete and Computational Geometry, 2nd Ed..

[22]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[23]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[24]  Paul N. Bennett Assessing the Calibration of Naive Bayes Posterior Estimates , 2000 .

[25]  David R. Karger,et al.  Tackling the Poor Assumptions of Naive Bayes Text Classifiers , 2003, ICML.

[26]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[27]  Karl-Michael Schneider,et al.  Techniques for Improving the Performance of Naive Bayes for Text Classification , 2005, CICLing.

[28]  Thorsten Joachims,et al.  Training linear SVMs in linear time , 2006, KDD '06.