An Improved Document Classification Approach with Maximum Entropy and Entropy Feature Selection

Document classification is an important task in the field of document management. Bayesian model needs the feature independent assumption; artificial neural network suffers from the overfitting problem; support vector machine (SVM) does not do well in real-value feature. This paper proposes to combine entropy and machine learning techniques for document classification. Firstly, the cross entropy and average mutual information are presented to effectively extract the features. Secondly, the support vector machine and maximum entropy model is adopted respectively to perform the classification task in the feature space. Furthermore, an improved feature description instead the binary feature with the real-value is presented in this text, since the prior knowledge of each word is helpful to document classification. Finally, we compare our method with the traditional methods, and the experiment showed our method increased 2.78 % F-measures than basic ME model, and 0.95% than naive Bayes model which was smoothed by Good-Turing algorithm.

[1]  Gao Cong,et al.  Semi-supervised Text Classification Using Partitioned EM , 2004, DASFAA.

[2]  Christopher Meek,et al.  Challenges of the Email Domain for Text Classification , 2000, ICML.

[3]  Stephan Bloehdorn,et al.  Boosting for Text Classification with Semantic Features , 2004, WebKDD.

[4]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[5]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[6]  Irena Koprinska,et al.  Phrases and Feature Selection in E-Mail Classification , 2004, ADCS.

[7]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[8]  Karl-Michael Schneider,et al.  A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence , 2004, ACL.

[9]  Thomas Hofmann,et al.  Text classification in a hierarchical mixture model for small training sets , 2001, CIKM '01.

[10]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[11]  Huan Liu,et al.  Robust feature induction for support vector machines , 2004, ICML '04.

[12]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[14]  Jason D. M. Rennie Improving multi-class text classification with Naive Bayes , 2001 .