RSGSA: a Robust and Stable Gene Selection Algorithm

Nowadays we are observing an explosion of gene expression data with phenotypes. It enables researchers to efficiently identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffers from being very underdetermined system. In a very large set of features but a very small sample size domain (e.g. DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several different spurious feature subsets may yield equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed robust and stable supervised gene selection algorithm to select the most discriminating non-spurious set of genes from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance level perturbations, respectively. We have performed rigorous experimental evaluations using 10 real gene expression microarray datasets with phenotypes. It revealed that our algorithm outperforms the state-of-the-art algorithms with respect to stability and classification accuracy.

[1]  Laetitia Vermeulen-Jourdan,et al.  Linkage disequilibrium study with a parallel adaptive GA , 2005, Int. J. Found. Comput. Sci..

[2]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[3]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[4]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Dr. Kailash Shaw,et al.  N-Gram and KLD Based Efficient Feature Selection Approach for Text Categorization-IJAERD , 2017 .

[7]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[8]  Sanguthevar Rajasekaran,et al.  Novel Randomized Feature Selection Algorithms , 2015, Int. J. Found. Comput. Sci..

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[11]  Christian F. A. Negre,et al.  Eigenvector centrality for characterization of protein allosteric pathways , 2017, Proceedings of the National Academy of Sciences.

[12]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[13]  Joab R Winkler,et al.  Numerical recipes in C: The art of scientific computing, second edition , 1993 .

[14]  Jacob Zahavi,et al.  Using simulated annealing to optimize the feature selection problem in marketing applications , 2006, Eur. J. Oper. Res..

[15]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[16]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..