Automatically computed document dependent weighting factor facility for Naïve Bayes classification

The Naive Bayes classification approach has been widely implemented in real-world applications due to its simplicity and low cost training and classifying algorithm. As a trade-off to its simplicity, the Naive Bayes technique has thus been reported to be one of the poorest-performing classification methods around. We have explored and investigated the Naive Bayes classification approach and found that one of the reasons that causes the low classification accuracy is the mis-classification of documents into several ''popular'' categories due to the improper organization of the training dataset where the distribution of training documents among categories is highly skewed. In this work, we propose a solution to the problem addressed above, which is the addition of the Automatically Computed Document Dependent (ACDD) weighting factor facility to the Naive Bayes classifier. The ACDD weighting factors are computed for the purpose of enhancing the classification performance by adjusting the probability values based on the density of classified documents in each available category to minimize the mis-classification rate.

[1]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[2]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[3]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[4]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[5]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[6]  Patrick Brézillon,et al.  Lecture Notes in Artificial Intelligence , 1999 .

[7]  Dino Isa,et al.  Using the self organizing map for clustering of text documents , 2009, Expert Syst. Appl..

[8]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[9]  Hisham Al-Mubaid,et al.  A New Text Categorization Technique Using Distributional Clustering and Learning Logic , 2006, IEEE Transactions on Knowledge and Data Engineering.

[10]  Mehmet Gonullu,et al.  Department of Computer Science and Engineering , 2011 .

[11]  Kari Torkkola,et al.  Linear Discriminant Analysis in Document Classification , 2007 .

[12]  Pedro M. Domingos,et al.  On the Optimality of the Simple Bayesian Classifier under Zero-One Loss , 1997, Machine Learning.

[13]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[14]  Dino Isa,et al.  Text Document Preprocessing with the Bayes Formula for Classification Using the Support Vector Machine , 2008, IEEE Transactions on Knowledge and Data Engineering.

[15]  Dino Isa,et al.  Polychotomiser for Case-based Reasoning beyond the Traditional Bayesian Classification Approach , 2008, Comput. Inf. Sci..

[16]  Inderjit S. Dhillon,et al.  Enhanced word clustering for hierarchical text classification , 2002, KDD.

[17]  Diego Sona,et al.  Clustering documents into a web directory for bootstrapping a supervised classification , 2005, Data Knowl. Eng..

[18]  Padmini Srinivasan,et al.  Automatic Text Categorization Using Neural Networks , 1997 .

[19]  Yi Lin,et al.  Support Vector Machines and the Bayes Rule in Classification , 2002, Data Mining and Knowledge Discovery.

[20]  Vipin Kumar,et al.  Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification , 2001, PAKDD.

[21]  Gerhard Knolmayer,et al.  Document Classification Methods for Organizing Explicit Knowledge , 2002 .

[22]  Sang-Bum Kim,et al.  Effective Methods for Improving Naive Bayes Text Classifiers , 2002, PRICAI.

[23]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[24]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.