A Novel Feature Selection Approach Based on Document Frequency of Segmented Term Frequency

Feature selection is a very important process in text classification. It can effectively eliminate redundant features and retain feature words with strong class distinguishing ability. In this paper, we propose a feature selection algorithm based on document frequency of segmented term frequency (STF-DF). In the algorithm, we also present two new concepts of “segmented term frequency” and “STF-DF. ” Then, we compare STF-DF with six commonly used feature selection algorithms (document frequency, information gain, chi-square, CMFS, NDM, and t-test) on three popular datasets (20 Newsgroups, Classic3, and WebKB). Experimental results show that our proposed algorithm can improve the accuracy of text classification and make the classification more effective.

[1]  Wenqian Shang,et al.  A novel feature selection algorithm for text categorization , 2007, Expert Syst. Appl..

[2]  Deqing Wang,et al.  t-Test feature selection approach based on term frequency for text categorization , 2014, Pattern Recognit. Lett..

[3]  Jose Miguel Puerta,et al.  Speeding up incremental wrapper feature subset selection with Naive Bayes classifier , 2014, Knowl. Based Syst..

[4]  Yiming Yang,et al.  High-performing feature selection for text classification , 2002, CIKM '02.

[5]  J. Novovicova,et al.  Information-theoretic feature selection algorithms for text classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[6]  Tom M. Mitchell,et al.  Learning to Extract Symbolic Knowledge from the World Wide Web , 1998, AAAI/IAAI.

[7]  Zhen Liu,et al.  A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization , 2012, Inf. Process. Manag..

[8]  D. Rajesh,et al.  An Automated Advanced Clustering Algorithm for Text Classification , 2022 .

[9]  Hans Peter Luhn,et al.  The Automatic Creation of Literature Abstracts , 1958, IBM J. Res. Dev..

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[11]  Jintao Li,et al.  A study on mutual information-based feature selectionfor text categorization , 2007 .

[12]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[13]  Serkan Günal,et al.  A novel probabilistic feature selection method for text classification , 2012, Knowl. Based Syst..

[14]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[15]  Murat Can Ganiz,et al.  Helmholtz principle based supervised and unsupervised feature selection methods for text mining , 2016, Inf. Process. Manag..

[16]  Kesari Verma,et al.  Variable Global Feature Selection Scheme for automatic classification of text documents , 2017, Expert systems with applications.

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Anirban Dasgupta,et al.  Feature selection methods for text classification , 2007, KDD '07.

[19]  Abdur Rehman,et al.  Feature selection based on a normalized difference measure for text classification , 2017, Inf. Process. Manag..

[20]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[21]  Nouman Azam,et al.  Comparison of term frequency and document frequency based feature selection metrics in text categorization , 2012, Expert Syst. Appl..

[22]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[25]  Jianhua Guo,et al.  Feature subset selection using naive Bayes for text classification , 2015, Pattern Recognit. Lett..

[26]  José Ranilla,et al.  A Hybrid Feature Selection Method for Text Categorization , 2007, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[27]  Alper Kursat Uysal,et al.  An improved global feature selection scheme for text classification , 2016, Expert Syst. Appl..

[28]  Harun Uguz,et al.  A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm , 2011, Knowl. Based Syst..