Graph-based semi-supervised learning with genomic data integration using condition-responsive genes applied to phenotype classification

Objective Data integration methods that combine data from different molecular levels such as genome, epigenome, transcriptome, etc., have received a great deal of interest in the past few years. It has been demonstrated that the synergistic effects of different biological data types can boost learning capabilities and lead to a better understanding of the underlying interactions among molecular levels. Methods In this paper we present a graph-based semi-supervised classification algorithm that incorporates latent biological knowledge in the form of biological pathways with gene expression and DNA methylation data. The process of graph construction from biological pathways is based on detecting condition-responsive genes, where 3 sets of genes are finally extracted: all condition responsive genes, high-frequency condition-responsive genes, and P-value-filtered genes. Results The proposed approach is applied to ovarian cancer data downloaded from the Human Genome Atlas. Extensive numerical experiments demonstrate superior performance of the proposed approach compared to other state-of-the-art algorithms, including the latest graph-based classification techniques. Conclusions Simulation results demonstrate that integrating various data types enhances classification performance and leads to a better understanding of interrelations between diverse omics data types. The proposed approach outperforms many of the state-of-the-art data integration algorithms.

[1]  Yudong D. He,et al.  Gene expression profiling predicts clinical outcome of breast cancer , 2002, Nature.

[2]  Manolis Kellis,et al.  HaploReg: a resource for exploring chromatin states, conservation, and regulatory motif alterations within sets of genetically linked variants , 2011, Nucleic Acids Res..

[3]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[4]  Li Zhang,et al.  Similarity-balanced discriminant neighbor embedding and its application to cancer classification based on gene expression data , 2015, Comput. Biol. Medicine.

[5]  Nilesh V. Patel,et al.  A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification , 2013, J. Biomed. Informatics.

[6]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[7]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[8]  D. Pe’er,et al.  An Integrated Approach to Uncover Drivers of Cancer , 2010, Cell.

[9]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[10]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[11]  Andreas Martin Lisewski,et al.  Graph-Based Protein Functional Classification , 2007, BIOCOMP.

[12]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[13]  Mohammad Hossein Fazel Zarandi,et al.  A new validation criteria for type-2 fuzzy c-means and possibilistic c-means , 2012, 2012 Annual Meeting of the North American Fuzzy Information Processing Society (NAFIPS).

[14]  Steven P. Lund,et al.  A Bayesian Integrative Genomic Model for Pathway Analysis of Complex Traits , 2012, Genetic epidemiology.

[15]  Saeid Nahavandi,et al.  Hidden Markov models for cancer classification using gene expression profiles , 2015, Inf. Sci..

[16]  Ju Han Kim,et al.  Intra-relation reconstruction from inter-relation: miRNA to gene expression , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[17]  Mikhail Belkin,et al.  Semi-Supervised Learning on Riemannian Manifolds , 2004, Machine Learning.

[18]  H. Hakonarson,et al.  ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data , 2010, Nucleic acids research.

[19]  Nigam H. Shah,et al.  The coming age of data-driven medicine: translational bioinformatics' next frontier , 2012, J. Am. Medical Informatics Assoc..

[20]  Sorin Draghici,et al.  Predicting HIV drug resistance with neural networks , 2003, Bioinform..

[21]  Lucila Ohno-Machado,et al.  Making it personal: translational bioinformatics , 2013, J. Am. Medical Informatics Assoc..

[22]  Mohammad Hossein Fazel Zarandi,et al.  Alpha-plane based automatic general type-2 fuzzy clustering based on simulated annealing meta-heuristic algorithm for analyzing gene expression data , 2015, Comput. Biol. Medicine.

[23]  Ram Samudrala,et al.  Functional annotation from predicted protein interaction networks , 2005, Bioinform..

[24]  Xiaojin Zhu,et al.  Introduction to Semi-Supervised Learning , 2009, Synthesis Lectures on Artificial Intelligence and Machine Learning.

[25]  Ju Han Kim,et al.  Incorporating inter-relationships between different levels of genomic data into cancer clinical outcome prediction. , 2014, Methods.

[26]  Eurie L. Hong,et al.  Annotation of functional variation in personal genomes using RegulomeDB , 2012, Genome research.

[27]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[28]  Chris Sander,et al.  Time to Recurrence and Survival in Serous Ovarian Tumors Predicted from Integrated Genomic Profiles , 2011, PloS one.

[29]  Andrey A. Shabalin,et al.  Matrix eQTL: ultra fast eQTL analysis via large matrix operations , 2011, Bioinform..

[30]  Ju Han Kim,et al.  Synergistic effect of different levels of genomic data for cancer clinical outcome prediction , 2012, J. Biomed. Informatics.

[31]  A. Tres,et al.  Gene expression‐based classifications of fibroadenomas and phyllodes tumours of the breast , 2015, Molecular oncology.

[32]  J. Low,et al.  Dengue Virus Activates Polyreactive, Natural IgG B Cells after Primary and Secondary Infection , 2011, PloS one.

[33]  G. Abecasis,et al.  A general test of association for quantitative traits in nuclear families. , 2000, American journal of human genetics.

[34]  M. H. Fazel Zarandi,et al.  A two-stage meta-heuristic approach to general type-ii fuzzy clustering for microarray data analysis , 2014, 2014 IEEE Conference on Norbert Wiener in the 21st Century (21CW).

[35]  Kyung-Ah Sohn,et al.  Knowledge boosting: a graph-based integration approach with multi-omics data and genomic knowledge for cancer clinical outcome prediction , 2014, J. Am. Medical Informatics Assoc..

[36]  Weida Tong,et al.  DNA Microarrays Are Predictive of Cancer Prognosis: A Re-evaluation , 2010, Clinical Cancer Research.

[37]  Vipin Kumar,et al.  Co-clustering phenome–genome for phenotype classification and disease gene discovery , 2012, Nucleic acids research.

[38]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[39]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.