PGCN: Disease gene prioritization by disease and gene embedding through graph convolutional neural networks

Motivation Proper prioritization of candidate genes is essential to the genome-based diagnostics of a range of genetic diseases. However, it is a highly challenging task involving limited and noisy knowledge of genes, diseases and their associations. While a number of computational methods have been developed for the disease gene prioritization task, their performance is largely limited by manually crafted features, network topology, or pre-defined rules of data fusion. Results Here, we propose a novel graph convolutional network-based disease gene prioritization method, PGCN, through the systematic embedding of the heterogeneous network made by genes and diseases, as well as their individual features. The embedding learning model and the association prediction model are trained together in an end-to-end manner. We compared PGCN with five state-of-the-art methods on the Online Mendelian Inheritance in Man (OMIM) dataset for tasks to recover missing associations and discover associations between novel genes and diseases. Results show significant improvements of PGCN over the existing methods. We further demonstrate that our embedding has biological meaning and can capture functional groups of genes. Availability The main program and the data are available at https://github.com/lykaust15/Disease_gene_prioritization_GCN.

[1]  Yu Li,et al.  DeeReCT-PolyA: a robust and generic deep learning method for PAS identification , 2018, Bioinform..

[2]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[3]  Yves Moreau,et al.  PINTA: a web server for network-based gene prioritization from expression data , 2011, Nucleic Acids Res..

[4]  Hsinchun Chen,et al.  Link prediction approach to collaborative filtering , 2005, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '05).

[5]  Nagarajan Natarajan,et al.  Inductive matrix completion for predicting gene–disease associations , 2014, Bioinform..

[6]  William B Dobyns,et al.  Infantile cerebral and cerebellar atrophy is associated with a mutation in the MED17 subunit of the transcription preinitiation mediator complex. , 2010, American journal of human genetics.

[7]  P. Sanseau,et al.  Drug repurposing: progress, challenges and recommendations , 2018, Nature Reviews Drug Discovery.

[8]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[9]  Jinyan Li,et al.  Disease gene identification by random walk on multigraphs merging heterogeneous genomic and phenotype data , 2012, BMC Genomics.

[10]  Christopher I. Bayly,et al.  Evaluating Virtual Screening Methods: Good and Bad Metrics for the "Early Recognition" Problem , 2007, J. Chem. Inf. Model..

[11]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[12]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[13]  Zhiwu Lu,et al.  CMsearch: simultaneous exploration of protein sequence space and structure space improves not only protein homology detection but also protein structure prediction , 2016, Bioinform..

[14]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[15]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[16]  D. G. MacArthur,et al.  Guidelines for investigating causality of sequence variants in human disease , 2014, Nature.

[17]  Le Song,et al.  Discriminative Embeddings of Latent Variable Models for Structured Data , 2016, ICML.

[18]  Tudor Groza,et al.  Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources , 2018, Nucleic Acids Res..

[19]  Lihua Li,et al.  DEEPre: sequence-based enzyme EC number prediction by deep learning , 2017, Bioinform..

[20]  Farzad Farnoud,et al.  HyDRA: gene prioritization via hybrid distance-score rank aggregation , 2015, Bioinform..

[21]  Bart De Moor,et al.  Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining , 2008, ECCB.

[22]  Olivier Sallou,et al.  GPSy: a cross-species gene prioritization system for conserved biological processes—application in male gamete development , 2012, Nucleic Acids Res..

[23]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[24]  Xin Gao,et al.  OPA2Vec: combining formal and informal content of biomedical ontologies to improve similarity-based prediction , 2018, Bioinform..

[25]  Yu Li,et al.  mlDEEPre: Multi-Functional Enzyme Function Prediction With Hierarchical Multi-Label Deep Learning , 2019, Front. Genet..

[26]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[27]  Xin Gao,et al.  Onto2Vec: joint vector-based representation of biological entities and their ontology-based annotations , 2018, Bioinform..

[28]  Mario Albrecht,et al.  NetworkPrioritizer: a versatile tool for network-based prioritization of candidate disease genes or other molecules , 2013, Bioinform..

[29]  John O. Woods,et al.  Prediction and Validation of Gene-Disease Associations Using Methods Inspired by Social Network Analyses , 2013, PloS one.

[30]  Andrey Rzhetsky,et al.  RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning , 2017, PLoS Comput. Biol..

[31]  Bart De Moor,et al.  Endeavour update: a web resource for gene prioritization in multiple species , 2008, Nucleic Acids Res..

[32]  W. Chung,et al.  Variants in GATA4 are a rare cause of familial and sporadic congenital diaphragmatic hernia , 2013, Human Genetics.

[33]  W. Chung,et al.  Clinical application of whole-exome sequencing across clinical indications , 2015, Genetics in Medicine.

[34]  Yves Moreau,et al.  pBRIT: gene prioritization by correlating functional and phenotypic annotations through integrative data fusion , 2018, Bioinform..

[35]  Marinka Zitnik,et al.  Gene Prioritization by Compressive Data Fusion and Chaining , 2015, PLoS Comput. Biol..

[36]  Y. Moreau,et al.  Beegle: from literature mining to disease-gene discovery , 2015, Nucleic acids research.

[37]  Avitan Gefen,et al.  Syndrome to gene (S2G): in‐silico identification of candidate genes for human diseases , 2010, Human mutation.

[38]  Le Song,et al.  Sequence2Vec: a novel embedding approach for modeling transcription factor binding affinity landscape , 2017, Bioinform..

[39]  W. G. Feero,et al.  Clinical application of whole-genome sequencing: proceed with care. , 2014, JAMA.

[40]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[41]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[42]  Jure Leskovec,et al.  Modeling polypharmacy side effects with graph convolutional networks , 2018, bioRxiv.

[43]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[44]  Guillaume Bouchard,et al.  Complex Embeddings for Simple Link Prediction , 2016, ICML.

[45]  Trinad Chakraborty,et al.  GECO-linear visualization for comparative genomics , 2007, Bioinform..

[46]  Haiyuan Yu,et al.  Network-based methods for human disease gene prediction. , 2011, Briefings in functional genomics.

[47]  Jana Marie Schwarz,et al.  GeneDistiller—Distilling Candidate Genes from Linkage Intervals , 2008, PloS one.

[48]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[49]  Jagdish Chandra Patra,et al.  Integration of multiple data sources to prioritize candidate genes using discounted rating system , 2010, BMC Bioinformatics.

[50]  Harm van Bakel,et al.  TEAM: a tool for the integration of expression, and linkage and association maps , 2004, European Journal of Human Genetics.

[51]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[52]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[53]  W. Mesker,et al.  c-Myb Enhances Breast Cancer Invasion and Metastasis through the Wnt/β-Catenin/Axin2 Pathway. , 2016, Cancer research.

[54]  Davis J. McCarthy,et al.  Factors influencing success of clinical genome sequencing across a broad spectrum of disorders , 2015, Nature Genetics.

[55]  Yves Moreau,et al.  Candidate gene prioritization with Endeavour , 2016, Nucleic Acids Res..

[56]  Kathleen M Spring,et al.  The protein tyrosine phosphatase DEP-1/PTPRJ promotes breast cancer cell invasion and metastasis , 2015, Oncogene.

[57]  Jure Leskovec,et al.  Graph Convolutional Neural Networks for Web-Scale Recommender Systems , 2018, KDD.

[58]  Juan Carlos Fernández,et al.  Multiobjective evolutionary algorithms to identify highly autocorrelated areas: the case of spatial distribution in financially compromised farms , 2014, Ann. Oper. Res..

[59]  P. Bork,et al.  G2D: a tool for mining genes associated with disease , 2005, BMC Genetics.

[60]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[61]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[62]  S. Ran,et al.  Paclitaxel therapy promotes breast cancer metastasis in a TLR4-dependent manner. , 2014, Cancer research.

[63]  Yves Moreau,et al.  Gene prioritization through geometric-inspired kernel data fusion , 2015, 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[64]  Jure Leskovec,et al.  Representation Learning on Graphs: Methods and Applications , 2017, IEEE Data Eng. Bull..

[65]  Pui-Yan Kwok,et al.  Prioritizing causal disease genes using unbiased genomic features , 2014, Genome Biology.

[66]  Yves Moreau,et al.  Gene prioritization using Bayesian matrix factorization with genomic and phenotypic side information , 2018, Bioinform..

[67]  Roded Sharan,et al.  Enhancing the Prioritization of Disease-Causing Genes through Tissue Specific Protein Interaction Networks , 2012, PLoS Comput. Biol..

[68]  R. Srinivasan,et al.  Phenotype-driven gene prioritization for rare diseases using graph convolution on heterogeneous networks , 2018, BMC Medical Genomics.

[69]  Yuanfang Guan,et al.  Tissue-Specific Functional Networks for Prioritizing Phenotype and Disease Genes , 2012, PLoS Comput. Biol..