A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification

We address gene selection and machine learning methods for cancer classification using microarray gene expression data. Due to the high dimensionality of microarray data, traditional gene selection algorithms are filter-based, focusing on intrinsic properties of the data such as distance, dependency, and correlation. These methods are fast but select far too many genes to use for the classification task. In this work, we present a new hybrid filter-wrapper gene subset selection algorithm that is an improved modification of our prior algorithm. Our proposed method employs interaction information to rank candidate genes to add into a gene subset. It then conditionally adds one gene at a time into the current subset and verifies whether the resultant subset improves the classification performance significantly. Only significant genes are selected, and the candidate gene list is updated every time a gene is added to the subset. Thus, our gene selection algorithm is very dynamic. Experimental results on ten public cancer microarray data sets show that our method consistently outperforms prior gene selection algorithms in terms of classification accuracy, while requiring a small number of selected genes.

[1]  Songyot Nakariyakul,et al.  Detecting thermophilic proteins through selecting amino acid and dipeptide composition features , 2011, Amino Acids.

[2]  Songyot Nakariyakul,et al.  High-dimensional hybrid feature selection using interaction information-guided search , 2018, Knowl. Based Syst..

[3]  Songyot Nakariyakul,et al.  Gene selection using interaction information for microarray-based cancer classification , 2016, 2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[4]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[5]  William J. McGill Multivariate information transmission , 1954, Trans. IRE Prof. Group Inf. Theory.

[6]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[7]  K. Stegmaier,et al.  Synergistic Drug Combinations with a CDK4/6 Inhibitor in T-cell Acute Lymphoblastic Leukemia , 2015, Clinical Cancer Research.

[8]  Jesús S. Aguilar-Ruiz,et al.  Fast feature selection aimed at high-dimensional data via hybrid-sequential-ranked searches , 2012, Expert Syst. Appl..

[9]  W. Thalheimer,et al.  How to calculate effect sizes from published research: A simplified methodology , 2002 .

[10]  Aixia Guo,et al.  Gene Selection for Cancer Classification using Support Vector Machines , 2014 .

[11]  Tzu-Tsung Wong,et al.  A gene selection method for microarray data based on risk genes , 2011, Expert Syst. Appl..

[12]  Zexuan Zhu,et al.  Markov blanket-embedded genetic algorithm for gene selection , 2007, Pattern Recognit..

[13]  Richard Weber,et al.  Simultaneous feature selection and classification using kernel-penalized support vector machines , 2011, Inf. Sci..

[14]  M. Heller DNA microarray technology: devices, systems, and applications. , 2002, Annual review of biomedical engineering.

[15]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[16]  Songyot Nakariyakul Feature subset selection using generalized steepest ascent search algorithm , 2009, 2009 Eighth International Symposium on Natural Language Processing.

[17]  Ash A. Alizadeh,et al.  Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling , 2000, Nature.

[18]  S. Drăghici,et al.  Analysis of microarray experiments of gene expression profiling. , 2006, American journal of obstetrics and gynecology.

[19]  Nir Friedman,et al.  Tissue classification with gene expression profiles , 2000, RECOMB '00.

[20]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[21]  T. H. Bø,et al.  New feature subset selection procedures for classification of expression profiles , 2002, Genome Biology.

[22]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[23]  John E. Moody,et al.  Data Visualization and Feature Selection: New Algorithms for Nongaussian Data , 1999, NIPS.

[24]  Songyot Nakariyakul,et al.  A sequence-based computational approach to predicting PDZ domain-peptide interactions. , 2014, Biochimica et biophysica acta.

[25]  Songyot Nakariyakul Suboptimal branch and bound algorithms for feature subset selection: A comparative study , 2014, Pattern Recognit. Lett..

[26]  E. Petricoin,et al.  Use of proteomic patterns in serum to identify ovarian cancer , 2002, The Lancet.

[27]  Jose Miguel Puerta,et al.  Fast wrapper feature subset selection in high-dimensional datasets by means of filter re-ranking , 2012, Knowl. Based Syst..

[28]  Jinyan Li,et al.  Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns , 2002, Bioinform..

[29]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[30]  T. Poggio,et al.  Prediction of central nervous system embryonal tumour outcome based on gene expression , 2002, Nature.

[31]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Verónica Bolón-Canedo,et al.  A review of microarray datasets and applied feature selection methods , 2014, Inf. Sci..

[33]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.