An Effective Dimension Reduction Approach to Chinese Document Classification Using Genetic Algorithm

Different kinds of methods have been proposed in Chinese document classification, while high dimension of feature vector is one of the most significant limits in these methods. In this paper, an important difference is pointed out between Chinese document classification and English document classification. Then an efficient approach is proposed to reduce the dimension of feature vector in Chinese document classification using Genetic Algorithm. Through merely choosing the set of much more "important" features, the proposed method significantly reduces the number of Chinese feature words. Experiments combining with several relative studies show that the proposed method has great effect on dimension reduction with little loss in correctly classified rate.

[1]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[2]  Ji Chen,et al.  An Incremental Chinese Text Classification Algorithm Based on Quick Clustering , 2008, 2008 International Symposiums on Information Processing.

[3]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD 2000.

[4]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[5]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[6]  Moustafa Ghanem,et al.  A novel refinement approach for text categorization , 2005, CIKM '05.

[7]  Kang Chen,et al.  Chinese Text Classification Based on Summarization Technique , 2007 .

[8]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[9]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[10]  Xiulan Hao,et al.  Accurate Chinese Text Classification via Multiple Strategies , 2007, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007).

[11]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[12]  Weitong Huang,et al.  Chinese Web-page Classification Study , 2007, 2007 IEEE International Conference on Control and Automation.

[13]  Maosong Sun,et al.  Leveraging World Knowledge in Chinese Text Classification , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).

[14]  Oren Etzioni,et al.  Fast and Intuitive Clustering of Web Documents , 1997, KDD.