RSGSA: a Robust and Stable Gene Selection Algorithm

Nowadays we are observing an explosion of gene expression data with phenotypes. It enables researchers to efficiently identify genes responsible for certain medical condition as well as classify them for drug target. Like any other phenotype data in medical domain, gene expression data with phenotypes also suffers from being very underdetermined system. In a very large set of features but a very small sample size domains (e.g., DNA microarray, RNA-seq data, GWAS data, etc.), it is often reported that several different spurious feature subsets may yield equally optimal results. This phenomenon is known as instability. Considering these facts, we have developed a very robust and stable supervised gene selection algorithm to select the most discriminating non-spurious set of genes from the gene expression datasets with phenotypes. Stability and robustness is ensured by class and instance levels perturbations, respectively. We have performed rigorous experimental evaluations using 10 real gene expression microarray datasets with phenotypes. It revealed that our algorithm outperforms the state-of-the-art algorithms with respect to stability and classification accuracy. We have also done biological enrichment analysis based on gene ontology-biological processes (GO-BP) terms, disease ontology (DO) terms, and biological pathways.

[1]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[2]  Christian F. A. Negre,et al.  Eigenvector centrality for characterization of protein allosteric pathways , 2017, Proceedings of the National Academy of Sciences.

[3]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[4]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[5]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[6]  Jacob Zahavi,et al.  Using simulated annealing to optimize the feature selection problem in marketing applications , 2006, Eur. J. Oper. Res..

[7]  F. Santosa,et al.  Linear inversion of ban limit reflection seismograms , 1986 .

[8]  William H. Press,et al.  Numerical Recipes in FORTRAN - The Art of Scientific Computing, 2nd Edition , 1987 .

[9]  D. Balding A tutorial on statistical methods for population association studies , 2006, Nature Reviews Genetics.

[10]  R. Wolff,et al.  Interleukin genes and associations with colon and rectal cancer risk and overall survival , 2013, International journal of cancer.

[11]  Christina Backes,et al.  An estimate of the total number of true human miRNAs , 2019, Nucleic acids research.

[12]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[13]  J. Harrow,et al.  Multiple evidence strands suggest that there may be as few as 19 000 human protein-coding genes , 2014, Human molecular genetics.

[14]  J. Beaulieu Integrin α6β4 in colorectal cancer. , 2010, World journal of gastrointestinal pathophysiology.

[15]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[16]  A. Karegowda,et al.  COMPARATIVE STUDY OF ATTRIBUTE SELECTION USING GAIN RATIO AND CORRELATION BASED FEATURE SELECTION , 2010 .

[17]  Victor Y. Pan,et al.  The complexity of the matrix eigenproblem , 1999, STOC '99.

[18]  S. Sathiya Keerthi,et al.  A Modified Finite Newton Method for Fast Solution of Large Scale Linear SVMs , 2005, J. Mach. Learn. Res..

[19]  C. Arteaga Epidermal growth factor receptor dependence in human tumors: more than just expression? , 2002, The oncologist.

[20]  P. Moore,et al.  Why do viruses cause cancer? Highlights of the first century of human tumour virology , 2010, Nature Reviews Cancer.

[21]  Sanguthevar Rajasekaran,et al.  Novel Randomized Feature Selection Algorithms , 2015, Int. J. Found. Comput. Sci..

[22]  P. Russo,et al.  Patients with colorectal and renal cell carcinoma diagnoses appear to be at risk for additional malignancies. , 2013, Clinical colorectal cancer.

[23]  M. Waterman,et al.  Lymphoid enhancer factor/T cell factor expression in colorectal cancer , 2004, Cancer and Metastasis Reviews.

[24]  Jason Weston,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2002, Machine Learning.

[25]  Jiasheng Wang,et al.  ACTB in cancer. , 2013, Clinica chimica acta; international journal of clinical chemistry.

[26]  D. Adelson,et al.  Revealing Missing Human Protein Isoforms Based on Ab Initio Prediction, RNA-seq and Proteomics , 2015, Scientific Reports.

[27]  Laetitia Vermeulen-Jourdan,et al.  Linkage disequilibrium study with a parallel adaptive GA , 2005, Int. J. Found. Comput. Sci..

[28]  Verónica Bolón-Canedo,et al.  Ensembles for feature selection: A review and future trends , 2019, Inf. Fusion.

[29]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[30]  Dongsheng Yan,et al.  LINC00261 suppresses human colon cancer progression via sponging miR‐324‐3p and inactivating the Wnt/β‐catenin pathway , 2019, Journal of cellular physiology.

[31]  Alan Clewer,et al.  Cambridge Dictionary of Statistics , 1999 .

[32]  M. Tucker,et al.  Risk of new cancers after radiotherapy in long-term survivors of retinoblastoma: an extended follow-up. , 2005, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[33]  R. Schapire The Strength of Weak Learnability , 1990, Machine Learning.

[34]  Lars Kai Hansen,et al.  Neural Network Ensembles , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  A. N. Tikhonov,et al.  Solutions of ill-posed problems , 1977 .

[37]  Kurt Bryan,et al.  The $25,000,000,000 Eigenvector: The Linear Algebra behind Google , 2006, SIAM Rev..

[38]  N. Normanno,et al.  Epidermal growth factor-related peptides and their receptors in human malignancies. , 1995, Critical reviews in oncology/hematology.