High-dimensional hybrid feature selection using interaction information-guided search

Abstract With the rapid growth of high-dimensional data sets in recent years, the need for reducing the dimensionality of data has grown significantly. Although wrapper approaches tend to achieve higher accuracy rates than filter techniques for the same number of selected features, only a few wrapper algorithms are applicable for high-dimensional data sets because the computational time becomes very excessive. We thus propose a new hybrid feature selection algorithm that is computationally efficient with high accuracy rates for high-dimensional data. The proposed method employs interaction information to guide the search, sequentially adds one feature at a time into the currently selected subset, and adopts early stopping to prevent overfitting and speed up the search. Our method is dynamic and selects only relevant and irredundant features that significantly improve the accuracy rates. Our experimental results for eleven high-dimensional data sets demonstrate that our algorithm consistently outperforms prior feature selection techniques, while requiring a reasonable amount of search time.

[1]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[2]  Songyot Nakariyakul,et al.  A comparative study of suboptimal branch and bound algorithms , 2014, Inf. Sci..

[3]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[4]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[5]  Songyot Nakariyakul,et al.  Internal damage inspection of almond nuts using optimal near-infrared waveband selection technique , 2014 .

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  Anil K. Jain,et al.  Small Sample Size Effects in Statistical Pattern Recognition: Recommendations for Practitioners , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[9]  Verónica Bolón-Canedo,et al.  Recent advances and emerging challenges of feature selection in the context of big data , 2015, Knowl. Based Syst..

[10]  Jesús S. Aguilar-Ruiz,et al.  Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches , 2012, Expert Syst. Appl..

[11]  M. Esmel ElAlami A filter model for feature subset selection based on genetic algorithm , 2009, Knowl. Based Syst..

[12]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[13]  Songyot Nakariyakul Suboptimal branch and bound algorithms for feature subset selection: A comparative study , 2014, Pattern Recognit. Lett..

[14]  Yang Wang,et al.  Mutual information-based method for selecting informative feature sets , 2013, Pattern Recognit..

[15]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[16]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[17]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[18]  Songyot Nakariyakul,et al.  A sequence-based computational approach to predicting PDZ domain-peptide interactions. , 2014, Biochimica et biophysica acta.

[19]  Sejong Oh,et al.  CBFS: High Performance Feature Selection Algorithm Based on Feature Clearness , 2012, PloS one.

[20]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[21]  Gil Alterovitz,et al.  Accelerating wrapper-based feature selection with K-nearest-neighbor , 2015, Knowl. Based Syst..

[22]  Songyot Nakariyakul,et al.  Detecting thermophilic proteins through selecting amino acid and dipeptide composition features , 2011, Amino Acids.

[23]  Gavin Brown,et al.  Conditional Likelihood Maximisation: A Unifying Framework for Mutual Information Feature Selection , 2012 .

[24]  Serdar Bozdag,et al.  A Feature Selection Algorithm to Compute Gene Centric Methylation from Probe Level Methylation Data , 2016, PloS one.

[25]  Constantin F. Aliferis,et al.  A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis , 2004, Bioinform..

[26]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.