Term-frequency Based Feature Selection Methods for Text Categorization

A major difficulty of text categorization is the high dimensionality of the feature space. Feature selection is an important step in text categorization to reduce the feature space. Automatic feature selection methods such as document frequency thresholding (DF), information gain (IG), mutual information (MI), and so on are commonly applied in text categorization, but they do not use term frequency information. In this paper, we put forward improved DF, improved IG and improved MI methods which use term frequency information. Experiments show that our improved methods are seen notable improvements in the performance than the original DF, IG and MI methods.