Combining Multiple Feature Selection Methods for Text Categorization by Using Rank-Score Characteristics

Feature selection is an important method for improving the efficiency and accuracy of text categorization algorithmsby removing redundant and irrelevant terms from the corpus.Extensive researches have been done to improve the performance ofindividual feature selection methods, but not much on their combinations.In this paper, we propose a method of combining multiple feature selection methods by using the Combinatorial Fusion Analysis (CFA). A rank-score function and its graph, called rank-score graph,are adopted to measure the diversity of different feature selection methods.We have shown that a combination of multiple feature selection methods can outperform a single method only if each individual feature selection method has unique scoring behavior and relatively high performance. Moreover, it is shown that the rank-score function and rank-score graph are useful for the selection of a combination of feature selection methods.

[1]  Damian M. Lyons,et al.  Combining multiple scoring systems for target tracking using rank-score characteristics , 2009, Inf. Fusion.

[2]  D. Frank Hsu,et al.  Comparing Rank and Score Combination Methods for Data Fusion in Information Retrieval , 2005, Information Retrieval.

[3]  Damian M. Lyons,et al.  RAF: a dynamic and efficient approach to fusion for multitarget tracking in CCTV surveillance , 2003, Proceedings of IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems, MFI2003..

[4]  Frederick Mosteller,et al.  Association and Estimation in Contingency Tables , 1968 .

[5]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[6]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Hui-Huang Hsu,et al.  Advanced Data Mining Technologies in Bioinformatics , 2006 .

[9]  D. Frank Hsu,et al.  A study of data fusion in Cayley graphs G(s/sub n/,p/sub n/) , 2004, 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings..

[10]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[12]  D. Frank Hsu,et al.  Combinatorial Fusion Analysis: Methods and Practices of Combining Multiple Scoring Systems , 2006 .

[13]  Wen-tau Yih,et al.  Raising the baseline for high-precision text classifiers , 2007, KDD '07.

[14]  W. John Wilbur,et al.  The automatic identification of stop words , 1992, J. Inf. Sci..

[15]  D. Frank Hsu,et al.  Consensus Scoring Criteria for Improving Enrichment in Virtual Screening , 2005, J. Chem. Inf. Model..

[16]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[17]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[18]  D. Frank Hsu,et al.  A Study of Data Fusion in Cayley Graphs G(S{n}, P{n}). , 2004 .

[19]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[20]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[21]  Marko Grobelnik,et al.  Feature selection using linear classifier weights: interaction with classification models , 2004, SIGIR '04.

[22]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[23]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[24]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[25]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.