Hybrid Method Based on Information Gain and Support Vector Machine for Gene Selection in Cancer Classification

It remains a great challenge to achieve sufficient cancer classification accuracy with the entire set of genes, due to the high dimensions, small sample size, and big noise of gene expression data. We thus proposed a hybrid gene selection method, Information Gain-Support Vector Machine (IG-SVM) in this study. IG was initially employed to filter irrelevant and redundant genes. Then, further removal of redundant genes was performed using SVM to eliminate the noise in the datasets more effectively. Finally, the informative genes selected by IG-SVM served as the input for the LIBSVM classifier. Compared to other related algorithms, IG-SVM showed the highest classification accuracy and superior performance as evaluated using five cancer gene expression datasets based on a few selected genes. As an example, IG-SVM achieved a classification accuracy of 90.32% for colon cancer, which is difficult to be accurately classified, only based on three genes including CSRP1, MYL9, and GUCA2B.

[1]  Juntao Li,et al.  Weighted doubly regularized support vector machine and its application to microarray classification with noise , 2016, Neurocomputing.

[2]  Kathryn P. Burdon,et al.  Novel missense mutation in the bZIP transcription factor, MAF, associated with congenital cataract, developmental delay, seizures and hearing loss (Aymé-Gripp syndrome) , 2017, BMC Medical Genetics.

[3]  Stanislaw Osowski,et al.  Data mining for feature selection in gene expression autism data , 2015, Expert Syst. Appl..

[4]  Yasser M Kadah,et al.  Detection of biomarkers for Hepatocellular Carcinoma using a hybrid univariate gene selection methods , 2012, Theoretical Biology and Medical Modelling.

[5]  J. Kent Information gain and a general measure of correlation , 1983 .

[6]  Pedro Larrañaga,et al.  Filter versus wrapper gene selection approaches in DNA microarray domains , 2004, Artif. Intell. Medicine.

[7]  T. Takenawa,et al.  SKIP Negatively Regulates Insulin-Induced GLUT4 Translocation and Membrane Ruffle Formation , 2003, Molecular and Cellular Biology.

[8]  Makoto Arai,et al.  Methylation Status of Genes Upregulated by Demethylating Agent 5-aza-2′-Deoxycytidine in Hepatocellular Carcinoma , 2007, Oncology.

[9]  Shutao Li,et al.  Gene selection using hybrid particle swarm optimization and genetic algorithm , 2008, Soft Comput..

[10]  Li Li,et al.  A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset. , 2005, Genomics.

[11]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[12]  Mohammad Hossein Moattar,et al.  A hybrid gene selection approach for microarray data classification using cellular learning automata and ant colony optimization. , 2016, Genomics.

[13]  Isabella Moroni,et al.  Mutations in INPP5K Cause a Form of Congenital Muscular Dystrophy Overlapping Marinesco-Sjögren Syndrome and Dystroglycanopathy , 2017, American journal of human genetics.

[14]  Arne K. Sandvik,et al.  The guanylate cyclase-C signaling pathway is down-regulated in inflammatory bowel disease , 2015, Scandinavian journal of gastroenterology.

[15]  Yunfei Li,et al.  Identification of germ cell-specific genes in mammalian meiotic prophase , 2013, BMC Bioinformatics.

[16]  Andreas Roos,et al.  Mutations in INPP5K, Encoding a Phosphoinositide 5-Phosphatase, Cause Congenital Muscular Dystrophy with Cataracts and Mild Cognitive Impairment , 2017, American journal of human genetics.

[17]  Nello Cristianini,et al.  Support vector machine classification and validation of cancer tissue samples using microarray expression data , 2000, Bioinform..

[18]  C. Devi Arockia Vanitha,et al.  Gene Expression Data Classification Using Support Vector Machine and Mutual Information-based Gene Selection☆ , 2015 .

[19]  S. Myung,et al.  Variants in the HEPSIN gene are associated with susceptibility to prostate cancer , 2012, Prostate Cancer and Prostatic Diseases.

[20]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[21]  E. Samuelson,et al.  Analysis of an independent tumor suppressor locus telomeric to Tp53 suggested Inpp5k and Myo1c as novel tumor suppressor gene candidates in this region , 2015, BMC Genetics.

[22]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[23]  Peter Kraft,et al.  Identification of Novel Genetic Markers of Breast Cancer Survival , 2015, Journal of the National Cancer Institute.

[24]  Jing Xu,et al.  Expression and prognostic significance of MYL9 in esophageal squamous cell carcinoma , 2017, PloS one.

[25]  Y. Chen,et al.  Identification of lung cancer oncogenes based on the mRNA expression and single nucleotide polymorphism profile data. , 2015, Neoplasma.

[26]  Huan Liu,et al.  Efficient Feature Selection via Analysis of Relevance and Redundancy , 2004, J. Mach. Learn. Res..

[27]  Janet L Stanford,et al.  Association of hepsin gene variants with prostate cancer risk and prognosis , 2010, The Prostate.

[28]  M. Heller DNA microarray technology: devices, systems, and applications. , 2002, Annual review of biomedical engineering.

[29]  B. Clémençon,et al.  The mitochondrial ADP/ATP carrier (SLC25 family): pathological implications of its dysfunction. , 2013, Molecular aspects of medicine.

[30]  M. Hasan Shaheed,et al.  Development of a two-stage gene selection method that incorporates a novel hybrid approach using the cuckoo optimization algorithm and harmony search for cancer classification , 2017, J. Biomed. Informatics.

[31]  Fillia Makedon,et al.  HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data , 2005, Bioinform..

[32]  Juan M. Corchado,et al.  An improved gSVM-SCADL2 with firefly algorithm for identification of informative genes and pathways , 2016, Int. J. Bioinform. Res. Appl..

[33]  Zhao min Deng,et al.  Analysis of genomic variation in lung adenocarcinoma patients revealed the critical role of PI3K complex , 2017, PeerJ.

[34]  Ferat Sahin,et al.  A survey on feature selection methods , 2014, Comput. Electr. Eng..

[35]  Evarist Planet,et al.  Enhanced MAF Oncogene Expression and Breast Cancer Bone Metastasis , 2015, Journal of the National Cancer Institute.

[36]  Saeid Nahavandi,et al.  A novel aggregate gene selection method for microarray data classification , 2015, Pattern Recognit. Lett..

[37]  H. Senturk,et al.  Effect of pre-operative red blood cell distribution on cancer stage and morbidity rate in patients with pancreatic cancer. , 2014, International journal of clinical and experimental medicine.

[38]  Xueguang Shao,et al.  Selecting significant genes by randomization test for cancer classification using gene expression data , 2013, J. Biomed. Informatics.

[39]  Caroline Maake,et al.  Occurrence and localization of uroguanylin in the aging human prostate , 2002, Histochemistry and Cell Biology.

[40]  A Hofman,et al.  Risk and prognosis. , 1995, The Netherlands journal of medicine.

[41]  Joseph S Koopmeiners,et al.  Vitamin D pathway gene variants and prostate cancer prognosis , 2010, The Prostate.

[42]  S. Riazuddin,et al.  INPP5K variant causes autosomal recessive congenital cataract in a Pakistani family , 2018, Clinical genetics.

[43]  T. Rabbitts,et al.  The LIM-domain protein Lmo2 is a key regulator of tumour angiogenesis: a new anti-angiogenesis drug target , 2002, Oncogene.

[44]  Wei-Chang Yeh,et al.  Gene selection using information gain and improved simplified swarm optimization , 2016, Neurocomputing.

[45]  A. Levine,et al.  Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. , 2001, Combinatorial chemistry & high throughput screening.

[46]  William H. Majoros,et al.  Orion: Detecting regions of the human non-coding genome that are intolerant to variation using population genetics , 2017, PloS one.

[47]  Wei-Chung Cheng,et al.  Gene selection for cancer identification: a decision tree model empowered by particle swarm optimization algorithm , 2014, BMC Bioinformatics.