N-grams based feature selection and text representation for Chinese Text Classification

In this paper, text representation and feature selection strategies for Chinese text classification based on n-grams are discussed. Two steps feature selection strategy is proposed which combines the preprocess within classes with the feature selection among classes. Four different feature selection methods and three text representation weights are compared by exhaustive experiments. Both C-SVC classifier and Naive bayes classifier are adopted to assess the results. All experiments are performed on Chinese corpus TanCorpV1.0 which includes more than 14,000 texts divided in 12 classes. Our experiments concern: (1) the performance comparison among different feature selection strategies: absolute text frequency, relative text frequency, absolute n-gram frequency and relative n-gram frequency; (2) the comparison of the sparseness and feature correlation in the “text by feature” matrices produced by four feature selection methods; (3) the performance comparison among three term weights: 0/1 logical value, n-gr...

[1]  Zhou Shui,et al.  A CHINESE DOCUMENT CATEGORIZATION SYSTEM WITHOUT DICTIONARY SUPPORT AND SEGMENTATION PROCESSING , 2001 .

[2]  Zhihua Wei,et al.  Feature Selection on Chinese Text Classification Using Character N-Grams , 2008, RSKT.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Zhihua Wei,et al.  Comparing different text representation and feature selection methods on Chinese text classification using Character n-grams , 2008 .

[5]  Huan Liu,et al.  Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution , 2003, ICML.

[6]  Moustafa Ghanem,et al.  A novel refinement approach for text categorization , 2005, CIKM '05.

[7]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[8]  Ricco Rakotomalala,et al.  TANAGRA : un logiciel gratuit pour l'enseignement et la recherche , 2005, EGC.

[9]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[10]  I. Song,et al.  Working Set Selection Using Second Order Information for Training Svm, " Complexity-reduced Scheme for Feature Extraction with Linear Discriminant Analysis , 2022 .

[11]  Jean-Hugues Chauchat,et al.  Pourquoi les n-grammes permettent de classer des textes? Recherche de mots-clefs pertinents à l'aide des n-grammes caractéristiques , 2002 .

[12]  Xiaohua Hu,et al.  Semantic Smoothing for Bayesian Text Classification with Small Training Data , 2008, SDM.

[13]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[14]  Thorsten Joachims,et al.  Learning to classify text using support vector machines - methods, theory and algorithms , 2002, The Kluwer international series in engineering and computer science.

[15]  Lin Dong A Ontology-based Document Feature Extraction , 2008 .