Complete Gini-Index Text (GIT) feature-selection algorithm for text classification

The recently introduced Gini-Index Text (GIT) feature-selection algorithm for text classification, through incorporating an improved Gini Index for better feature-selection performance, has some drawbacks. Specifically, the algorithm, under real-world experimental conditions, concentrates feature values to one point and be inadequate for selecting representative features. As such, good representative features cannot be estimated, and neither, moreover, can good performance be achieved in unbalanced text classification. Therefore, we suggest a new complete GIT feature-selection algorithm for text classification. The new algorithm, according to experimental results, could obtain unbiased feature values, and could eliminate many irrelevant and redundant features from feature subsets while retaining many representative features. Furthermore, the new algorithm, compared with the original version, demonstrated a notably improved overall classification performance.

[1]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  Shiwei Tang,et al.  Two odds-radio-based text classification algorithms , 2002, Proceedings of the Third International Conference on Web Information Systems Engineering (Workshops), 2002..

[4]  Gang Wang,et al.  Feature selection with conditional mutual information maximin in text categorization , 2004, CIKM '04.

[5]  Rohini K. Srihari,et al.  Feature selection for text categorization on imbalanced data , 2004, SKDD.

[6]  Philip J. Stone,et al.  Experiments in induction , 1966 .

[7]  Dunja Mladenic,et al.  Feature Subset Selection in Text-Learning , 1998, ECML.

[8]  Haibin Zhu,et al.  An Adaptive Fuzzy kNN Text Classifier Based on Gini Index Weight , 2006, 11th IEEE Symposium on Computers and Communications (ISCC'06).

[9]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[10]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[11]  Carolin Strobl,et al.  Unbiased split selection for classification trees based on the Gini Index , 2007, Comput. Stat. Data Anal..

[12]  Kilian Stoffel,et al.  Theoretical Comparison between the Gini Index and Information Gain Criteria , 2004, Annals of Mathematics and Artificial Intelligence.

[13]  Guy W. Mineau,et al.  A simple KNN algorithm for text categorization , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[14]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..