Redundant Feature Elimination by Using Approximate Markov Blanket Based on Discriminative Contribution

As a high dimensional problem, it is a hard task to analyze the text data sets, where many weakly relevant but redundant features hurt generalization performance of classifiers. There are previous works to handle this problem by using pair-wise feature similarities, which do not consider discriminative contribution of each feature by utilizing the label information. Here we define an Approximate Markov Blanket (AMB) based on the metric of DIScriminative Contribution (DISC) to eliminate redundant features and propose the AMB-DISC algorithm. Experimental results on the data set of Reuter-21578 show AMBDISC is much better than the previous state-of-arts feature selection algorithms considering feature redundancy in terms of MicroavgF1 and MacroavgF1.

[1]  Pat Langley,et al.  Selection of Relevant Features and Examples in Machine Learning , 1997, Artif. Intell..

[2]  Daphne Koller,et al.  Toward Optimal Feature Selection , 1996, ICML.

[3]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[4]  Yun He,et al.  A novel method for high accuracy sumoylation site prediction from protein sequences , 2008, BMC Bioinformatics.

[5]  Chris H. Q. Ding,et al.  Evolving Feature Selection , 2005, IEEE Intell. Syst..

[6]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[7]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[8]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[9]  Gengfeng Wu,et al.  Dimension reduction with redundant gene elimination for tumor classification , 2008, BMC Bioinformatics.

[10]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[11]  Ian Witten,et al.  Data Mining , 2000 .

[12]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Kai Yu,et al.  Feature Selection for Gene Expression Using Model-Based Entropy , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.