The inadequacy of the information gain is taken into account the situation that the term does not appear. But, in this paper, by analyzing the distribution information of terms, we find if the value of Distribution Information inside a Class of the term becomes large, the distribution of the term inclines to imbalance, and if there is high imbalance of the term, the Distribution Information among Classes will tend to a smaller value. Therefore, the Distribution Information inside a Class and Distribution Information among Classes are introduced to this paper to reduce the effect of the term does not appear, and improve the traditional information gain. After experimental verification, the improved algorithm (GDI) has a better performance than traditional feature selection algorithm in some fields, such as the Information Gain.
[1]
Luo Zhensheng,et al.
An Improved Approach to Term Weighting in Automated Text Classification
,
2005
.
[2]
Yiming Yang,et al.
An Evaluation of Statistical Approaches to Text Categorization
,
1999,
Information Retrieval.
[3]
Chen Yi-ying.
Classifying Text Corpus Based on Information Gain Weight of Feature
,
2006
.
[4]
Lu Yu.
ANALYSIS AND CONSTRUCTION OF WORD WEIGHING FUNCTION IN VSM
,
2002
.
[5]
Yiming Yang,et al.
A Comparative Study on Feature Selection in Text Categorization
,
1997,
ICML.
[6]
Zhao Guang-fu.
Feature Reduction Based on Relative Document Frequency Balance Information Gain
,
2008
.