Learning effective features for Chinese text categorization

Text categorization task always suffers from a high dimension problem, which leads the learning system to be in a status of either lower efficiency or lower performance. A number of feature selection methods have therefore been adopted or proposed for its dimensional reduction, such as DF, IG, Chi Square and so on. Unlike those traditional feature selection methods, in this paper, a feature selection method based on the idea of "discriminative learning" is presented, where those learned "effective" features rather than traditional "important" features are used to construct feature space. During learning effective features, a variant AdaBoost algorithm as well as a pairwise multiclass learning scheme are adopted. Simulation results show the presented method works well.

[1]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[2]  George Forman,et al.  A pitfall and solution in multi-class feature selection for text classification , 2004, ICML.

[3]  Paul A. Viola,et al.  Robust Real-time Object Detection , 2001 .

[4]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[7]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[8]  Yoram Singer,et al.  BoosTexter: A Boosting-based System for Text Categorization , 2000, Machine Learning.

[9]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[10]  Kjersti Aas,et al.  Text Categorisation: A Survey , 1999 .

[11]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[12]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[13]  Yoram Singer,et al.  Context-sensitive learning methods for text categorization , 1996, SIGIR '96.

[14]  Andreas S. Weigend,et al.  A neural network approach to topic spotting , 1995 .

[15]  Qun Liu,et al.  Chinese Lexical Analysis Using Hierarchical Hidden Markov Model , 2003, SIGHAN.

[16]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[17]  David E. Johnson,et al.  Maximizing Text-Mining Performance , 1999 .

[18]  Zhang Hua-ping Model of Chinese Words Rough Segmentation Based on N-Shortest-Paths Method , 2002 .