Using Gini-Index for Feature Selection in Text Categorization

With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. The first and major problem of text categorization is how to select the best subset from the original high feature space in order to reduce the high dimensionality of the original feature space and improve the classification performance. We aim to use improved Gini-index for text feature selection, constructing the measure function based on Gini-Index. We compare it to other four feature selection measures using two kinds of classifiers on two different document corpus. The result of experiments shows that its performance is comparable with other text feature selection approaches. However, it is perfect in the time complexity of algorithm. Index Terms - text categorization, feature selection, Gini-Index, feature selection function.

[1]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[2]  Srikant Rajagopal,et al.  Feature selection & dominant feature selection for product reviews using meta-heuristic algorithms , 2010, Bangalore Compute Conf..

[3]  S.N. Saleh,et al.  A feature selection algorithm with redundancy reduction for text classification , 2007, 2007 22nd international symposium on computer and information sciences.

[4]  Nasser Ghasem-Aghaee,et al.  Text feature selection using ant colony optimization , 2009, Expert Syst. Appl..

[5]  Peerapon Vateekul,et al.  Fast Induction of Multiple Decision Trees in Text Categorization from Large Scale, Imbalanced, and Multi-label Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[6]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[7]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[8]  Liu Yuling,et al.  Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization , 2006 .

[9]  Lu Yu ANALYSIS AND CONSTRUCTION OF WORD WEIGHING FUNCTION IN VSM , 2002 .

[10]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[11]  Kairong Li,et al.  Research on Hidden Markov Model-based Text Categorization Process , 2011 .

[12]  Xue Sun,et al.  Multi-class text categorization based on LDA and SVM , 2011 .

[13]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[14]  George Forman,et al.  BNS feature scaling: an improved representation over tf-idf for svm text classification , 2008, CIKM '08.

[15]  Thomas Roelleke,et al.  TF-IDF uncovered: a study of theories and probabilities , 2008, SIGIR '08.

[16]  George Karypis,et al.  A Feature Weight Adjustment Algorithm for Document Categorization , 2000 .

[17]  Gary Geunbae Lee,et al.  Information gain and divergence-based feature selection for machine learning-based text categorization , 2006, Inf. Process. Manag..

[18]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[19]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[20]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..

[21]  Shang Lei,et al.  A Feature Selection Method Based on Information Gain and Genetic Algorithm , 2012, 2012 International Conference on Computer Science and Electronics Engineering.