A filter feature selection method based on the Maximal Information Coefficient and Gram-Schmidt Orthogonalization for biomedical data mining

A filter feature selection technique has been widely used to mine biomedical data. Recently, in the classical filter method minimal-Redundancy-Maximal-Relevance (mRMR), a risk has been revealed that a specific part of the redundancy, called irrelevant redundancy, may be involved in the minimal-redundancy component of this method. Thus, a few attempts to eliminate the irrelevant redundancy by attaching additional procedures to mRMR, such as Kernel Canonical Correlation Analysis based mRMR (KCCAmRMR), have been made. In the present study, a novel filter feature selection method based on the Maximal Information Coefficient (MIC) and Gram-Schmidt Orthogonalization (GSO), named Orthogonal MIC Feature Selection (OMICFS), was proposed to solve this problem. Different from other improved approaches under the max-relevance and min-redundancy criterion, in the proposed method, the MIC is used to quantify the degree of relevance between feature variables and target variable, the GSO is devoted to calculating the orthogonalized variable of a candidate feature with respect to previously selected features, and the max-relevance and min-redundancy can be indirectly optimized by maximizing the MIC relevance between the GSO orthogonalized variable and target. This orthogonalization strategy allows OMICFS to exclude the irrelevant redundancy without any additional procedures. To verify the performance, OMICFS was compared with other filter feature selection methods in terms of both classification accuracy and computational efficiency by conducting classification experiments on two types of biomedical datasets. The results showed that OMICFS outperforms the other methods in most cases. In addition, differences between these methods were analyzed, and the application of OMICFS in the mining of high-dimensional biomedical data was discussed. The Matlab code for the proposed method is available at https://github.com/lhqxinghun/bioinformatics/tree/master/OMICFS/.

[1]  K. Buetow Cyberinfrastructure: Empowering a "Third Way" in Biomedical Research , 2005, Science.

[2]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[3]  Jianqing Fan,et al.  Sure independence screening for ultrahigh dimensional feature space , 2006, math/0612857.

[4]  Bernhard Schölkopf,et al.  Feature selection and transduction for prediction of molecular bioactivity for drug design , 2003, Bioinform..

[5]  Jacek M. Zurada,et al.  Normalized Mutual Information Feature Selection , 2009, IEEE Transactions on Neural Networks.

[6]  W Y Zhang,et al.  Discussion on `Sure independence screening for ultra-high dimensional feature space' by Fan, J and Lv, J. , 2008 .

[7]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Kuldip K. Paliwal,et al.  A feature selection method using improved regularized linear discriminant analysis , 2014, Machine Vision and Applications.

[9]  Yvan Saeys,et al.  Statistical interpretation of machine learning-based feature importance scores for biomarker discovery , 2012, Bioinform..

[10]  Belur V. Dasarathy,et al.  Nearest neighbor (NN) norms: NN pattern classification techniques , 1991 .

[11]  Max Kuhn,et al.  Applied Predictive Modeling , 2013 .

[12]  M. L. Dewal,et al.  Texture based classification of the severity of mitral regurgitation , 2016, Comput. Biol. Medicine.

[13]  Fikret S. Gürgen,et al.  A feature selection method based on kernel canonical correlation analysis and the minimum Redundancy-Maximum Relevance filter method , 2012, Expert Syst. Appl..

[14]  Oleg V. Favorov,et al.  Using covariates for improving the minimum redundancy maximum relevance feature selection method , 2010 .

[15]  Christopher Leckie,et al.  FSR: feature set reduction for scalable and accurate multi-class cancer subtype classification based on copy number , 2012, Bioinform..

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Patrick Granton,et al.  Radiomics: extracting more information from medical images using advanced feature analysis. , 2012, European journal of cancer.

[18]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[19]  Enrico Gratton,et al.  Supervised Machine Learning for Classification of the Electrophysiological Effects of Chronotropic Drugs on Human Induced Pluripotent Stem Cell-Derived Cardiomyocytes , 2015, PloS one.

[20]  Thibault Helleputte,et al.  Robust biomarker identification for cancer diagnosis with ensemble feature selection methods , 2010, Bioinform..

[21]  Ji-Hyun Kim,et al.  Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap , 2009, Comput. Stat. Data Anal..

[22]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[23]  Charles Elkan,et al.  Quadratic Programming Feature Selection , 2010, J. Mach. Learn. Res..

[24]  Cesare Furlanello,et al.  minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers , 2012, Bioinform..

[25]  Sebastian Zaunseder,et al.  Optimization of ECG Classification by Means of Feature Selection , 2011, IEEE Transactions on Biomedical Engineering.

[26]  Constantin F. Aliferis,et al.  GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data , 2005, Int. J. Medical Informatics.

[27]  James Bailey,et al.  Effective global approaches for mutual information based feature selection , 2014, KDD.

[28]  Marcel Jirina,et al.  Classifiers Based on Inverted Distances , 2011 .

[29]  Eric Bender,et al.  Big data in biomedicine , 2015, Nature.

[30]  Qianchuan He,et al.  BIOINFORMATICS ORIGINAL PAPER , 2022 .