An Algorithm of Feature Selection and Feature Weighting Adjustment Based on Chinese FrameNet

The combination of TF and DF, which is used as the method of feature selection, and TF-IDF algorithm, which is used as feature weighting, are frequently used in the text categorization. But for a small training set, the combination of TF and DF will filter out many low-frequency words which have a strong capability of the feature discrimination. Hence the weight is directly influenced. In this paper, an algorithm of feature selection and feature weighting adjustment based on Chinese FrameNet (CFN) are presented which aims at solving the problem mentioned above. The experimental result indicates that the precision which is greater than the traditional algorithm can reach to 67.3% and can fits the small training set very well. Feature selection and feature weighting adjustment are the research focuses of text categorization based on statistic. More mature feature selection methods include feature frequency (TF), text frequency (DF), feature entropy (TE), mutual information (MI), information gain (IG), etc. (1); Common functions for feature weighting adjustment include TF-IDF function, boolean function, square root functions, logarithmic function, etc (2). These feature selection methods usually require the support of large-scale training set, and end up in building a high-dimensional feature vector, which seriously affect the speed of the latter categorization process. At the same time, setting up a large-scale training set it requires a lot of manual labor. Therefore, excogitate a simple as well as more effective algorithm of feature selection and feature weighting adjustment in the small-scale training set environment has a certain practical significance. The combination of TF and DF, which is used as the method of feature selection, and TF-IDF algorithm, which is used as feature weighting, are frequently used in the text categorization system. Practice proved that under a larger scale of training set, the above algorithm could achieve a preferable result. However, through the research we found that when the scale of training set is small, the results are not so satisfactory. The reason is that in the smaller scale of training set, some words which have a strong capability of the feature discrimination will be filtered out due to their low frequency; or may be due to the low weight, so as to take into little effect of categorization. Accordingly, this paper presents an improved algorithm of feature selection and feature weighting adjustment. This algorithm based on CFN shows the improvement toward the above algorithm of feature selection and feature weighting adjustment. Finally, we also evaluated the effectiveness of the algorithm.

[1]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[2]  Dianhui Wang,et al.  A data mining approach for fuzzy classification rule generation , 2001, Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569).

[3]  Tao Liu,et al.  Chinese FrameNet and OWL Representation , 2007, Sixth International Conference on Advanced Language Processing and Web Information Technology (ALPIT 2007).