Improvement and Application of TF•IDF Method Based on Text Classification

Feature extraction is the important prerequisite of classifying text effectively and automatically. TF•IDF is widely used to express the text feature weight. But it has some problems. TF•IDF can't reflect the distribution of terms in the text, and then can't reflect the importance degree and the difference between categories. This paper proposes a new feature weighting method-TF•IDF•Ci to which a new weight Ci is added to express the differences between classes on the base of original TF•IDF. After combining TF•IDF•Ci and specific classification algorithm, it always get a larger macro F1 value than of TF•IDF. Meanwhile, the standard deviation of the classification index of the TF•IDF•Ci is much smaller than that of TF•IDF. That shows TF•IDF•Ci not only improve the classification precision but also decreases the sensitivity towards feature dimensions to some extent.