Use of relevancy and complementary information for discriminatory gene selection from high-dimensional cancer data

With the advent of high-throughput technologies, life sciences are generating a huge amount of biomolecular data. Global gene expression profiles provide a snapshot of all the genes that are transcribed or not in a cell or in a tissue at a particular moment under a particular condition. The high-dimensionality of such gene expression data (i.e., very large number of features/genes analyzed in relatively much less number of samples) makes it difficult to identify the key genes (biomarkers) that are truly and more significantly attributing to a particular phenotype or condition, such as cancer or disease, de novo. With the increase in the number of genes, simple feature selection methods show poor performance for both selecting the effective and informative features and capturing biological information. Addressing these issues, here we propose Mutual information based Gene Selection method (MGS) for selecting informative genes and two ranking methods based on frequency (MGSf) and Random Forest (MGSrf) for ranking the selected genes. We tested our methods on four real gene expression datasets derived from different studies on cancerous and normal samples. Our methods obtained better classification rate with the datasets compared to recently reported methods. Our methods could also detect the key relevant pathways with a causal relationship to the phenotype.

[1]  Homayoun Valafar,et al.  Identification of novel cancer therapeutic targets using a designed and pooled shRNA library screen , 2017, Scientific Reports.

[2]  R. Tibshirani,et al.  Diagnosis of multiple cancer types by shrunken centroids of gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Zhengfei Shan,et al.  CTCF regulates the FoxO signaling pathway to affect the progression of prostate cancer , 2019, Journal of cellular and molecular medicine.

[4]  Beat Pfister,et al.  A Semidefinite Programming Based Search Strategy for Feature Selection with Mutual Information Measure , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Ying Liu,et al.  Critical role of FOXO3a in carcinogenesis , 2018, Molecular Cancer.

[6]  Mark A. Hall,et al.  Correlation-based Feature Selection for Machine Learning , 2003 .

[7]  Nadine Abu Rumman,et al.  Geometry Compression for 3D Polygonal Models using a Neural Network , 2010 .

[8]  James Bailey,et al.  Effective global approaches for mutual information based feature selection , 2014, KDD.

[9]  S. Shukla,et al.  FOXO3a: A Potential Target in Prostate Cancer. , 2014, Austin journal of urology.

[10]  Kwok Wai Lo,et al.  Epstein–Barr virus infection and nasopharyngeal carcinoma , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[11]  Tao Liu,et al.  Efficient feature selection and classification for microarray data , 2018, PloS one.

[12]  Songyot Nakariyakul,et al.  High-dimensional hybrid feature selection using interaction information-guided search , 2018, Knowl. Based Syst..

[13]  Catarina Eloy,et al.  Classification of breast cancer histology images using Convolutional Neural Networks , 2017, PloS one.

[14]  Y. Kagawa,et al.  Stable structure of thermophilic proton ATPase beta subunit. , 1986, Journal of biochemistry.

[15]  Driss Aboutajdine,et al.  A two-stage gene selection scheme utilizing MRMR filter and GA wrapper , 2011, Knowledge and Information Systems.

[16]  Tianwei Yu,et al.  A Deep Neural Network Model using Random Forest to Extract Feature Representation for Gene Expression Data Classification , 2018, Scientific Reports.

[17]  Othman Soufan,et al.  NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis , 2019, Nucleic Acids Res..

[18]  Ruili Huang,et al.  Comprehensive analysis of pathway or functionally related gene expression in the National Cancer Institute's anticancer screen. , 2006, Genomics.

[19]  G. MacLennan,et al.  Deregulation of FoxO3a accelerates prostate cancer progression in TRAMP mice , 2013, The Prostate.

[20]  Chris H. Q. Ding,et al.  Minimum redundancy feature selection from microarray gene expression data , 2003, Computational Systems Bioinformatics. CSB2003. Proceedings of the 2003 IEEE Bioinformatics Conference. CSB2003.

[21]  James Bailey,et al.  Can high-order dependencies improve mutual information based feature selection? , 2016, Pattern Recognit..

[22]  Sejong Yoon,et al.  Mutual information-based SVM-RFE for diagnostic classification of digitized mammograms , 2009, Pattern Recognit. Lett..

[23]  Biao Han,et al.  The rhythms of predictive coding? Pre-stimulus phase modulates the influence of shape perception on luminance judgments , 2016, Scientific Reports.

[24]  Oksam Chae,et al.  Simultaneous feature selection and discretization based on mutual information , 2019, Pattern Recognit..

[25]  Ghada Hany Badr,et al.  Genetic Bee Colony (GBC) algorithm: A new gene selection method for microarray cancer classification , 2015, Comput. Biol. Chem..

[26]  Songyot Nakariyakul,et al.  A hybrid gene selection algorithm based on interaction information for microarray-based cancer classification , 2019, PloS one.

[27]  S. Teo,et al.  The ATM tumour suppressor gene is down‐regulated in EBV‐associated nasopharyngeal carcinoma , 2009, The Journal of pathology.

[28]  Hugues Bersini,et al.  A Survey on Filter Techniques for Feature Selection in Gene Expression Microarray Analysis , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Ya Cao,et al.  EBV based cancer prevention and therapy in nasopharyngeal carcinoma , 2017, npj Precision Oncology.

[30]  M AlshamlanHala,et al.  Genetic Bee Colony (GBC) algorithm , 2015 .

[31]  Christopher W. Dawson,et al.  Epstein-Barr virus and nasopharyngeal carcinoma , 2014, Chinese journal of cancer.

[32]  Manoj Bhasin,et al.  Identification of the Transcription Factor Single-Minded Homologue 2 as a Potential Biomarker and Immunotherapy Target in Prostate Cancer , 2009, Clinical Cancer Research.

[33]  Nada Almugren,et al.  A Survey on Hybrid Feature Selection Methods in Microarray Gene Expression Data for Cancer Classification , 2019, IEEE Access.

[34]  Keinosuke Fukunaga,et al.  A Branch and Bound Algorithm for Feature Subset Selection , 1977, IEEE Transactions on Computers.

[35]  Hala Alshamlan,et al.  mRMR-ABC: A Hybrid Gene Selection Algorithm for Cancer Classification Using Microarray Gene Expression Profiling , 2015, BioMed research international.

[36]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[37]  Hung-Wen Chiu,et al.  Risk classification of cancer survival using ANN with gene expression data from multiple laboratories , 2014, Comput. Biol. Medicine.

[38]  J.C. Rajapakse,et al.  SVM-RFE With MRMR Filter for Gene Selection , 2010, IEEE Transactions on NanoBioscience.

[39]  Brian C. Ross Mutual Information between Discrete and Continuous Data Sets , 2014, PloS one.

[40]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[41]  M. Newton,et al.  Genes Involved in DNA Repair and Nitrosamine Metabolism and Those Located on Chromosome 14q32 Are Dysregulated in Nasopharyngeal Carcinoma , 2006, Cancer Epidemiology Biomarkers & Prevention.

[42]  K. Hemminki,et al.  Familial association of pancreatic cancer with other malignancies in Swedish families , 2009, British Journal of Cancer.

[43]  Duncan Fyfe Gillies,et al.  A Review of Feature Selection and Feature Extraction Methods Applied on Microarray Data , 2015, Adv. Bioinformatics.

[44]  Ulrich Mansmann,et al.  A 29-gene and cytogenetic score for the prediction of resistance to induction treatment in acute myeloid leukemia , 2017, Haematologica.

[45]  Mohammad Shoyaib,et al.  Feature Selection and Discretization based on Mutual Information , 2017, 2017 IEEE International Conference on Imaging, Vision & Pattern Recognition (icIVPR).

[46]  Mohamed F. Ghalwash,et al.  Minimum redundancy maximum relevance feature selection approach for temporal gene expression data , 2017, BMC Bioinformatics.

[47]  M O'Neill,et al.  Endoscopic findings in patients after definitive gastric surgery. , 1975, Irish medical journal.

[48]  Asha Gowda Karegowda,et al.  Feature Subset Selection Problem using Wrapper Approach in Supervised Learning , 2010 .

[49]  Isabell Witzel,et al.  Reduced mannosidase MAN1A1 expression leads to aberrant N-glycosylation and impaired survival in breast cancer , 2018, British Journal of Cancer.