Benchmarking network-based gene prioritization methods for cerebral small vessel disease

Abstract Network-based gene prioritization algorithms are designed to prioritize disease-associated genes based on known ones using biological networks of protein interactions, gene–disease associations (GDAs) and other relationships between biological entities. Various algorithms have been developed based on different mechanisms, but it is not obvious which algorithm is optimal for a specific disease. To address this issue, we benchmarked multiple algorithms for their application in cerebral small vessel disease (cSVD). We curated protein–gene interactions (PGIs) and GDAs from databases and assembled PGI networks and disease–gene heterogeneous networks. A screening of algorithms resulted in seven representative algorithms to be benchmarked. Performance of algorithms was assessed using both leave-one-out cross-validation (LOOCV) and external validation with MEGASTROKE genome-wide association study (GWAS). We found that random walk with restart on the heterogeneous network (RWRH) showed best LOOCV performance, with median LOOCV rediscovery rank of 185.5 (out of 19 463 genes). The GenePanda algorithm had most GWAS-confirmable genes in top 200 predictions, while RWRH had best ranks for small vessel stroke-associated genes confirmed in GWAS. In conclusion, RWRH has overall better performance for application in cSVD despite its susceptibility to bias caused by degree centrality. Choice of algorithms should be determined before applying to specific disease. Current pure network-based gene prioritization algorithms are unlikely to find novel disease-associated genes that are not associated with known ones. The tools for implementing and benchmarking algorithms have been made available and can be generalized for other diseases.

[1]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[2]  D. Dickson,et al.  CNS small vessel disease , 2019, Neurology.

[3]  R. Jiang,et al.  Walking on a Tissue-Specific Disease-Protein-Complex Heterogeneous Network for the Discovery of Disease-Related Protein Complexes , 2013, BioMed research international.

[4]  Jure Leskovec,et al.  node2vec: Scalable Feature Learning for Networks , 2016, KDD.

[5]  Fedor A. Kolpakov,et al.  GTRD: a database on gene transcription regulation—2019 update , 2018, Nucleic Acids Res..

[6]  C. Sudlow,et al.  Beyond the Brain , 2020, Stroke.

[7]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[8]  Xuequn Shang,et al.  Predicting disease-related genes using integrated biomedical networks , 2017, BMC Genomics.

[9]  Maoqiang Xie,et al.  Prioritizing Disease Genes by Bi-Random Walk , 2012, PAKDD.

[10]  Joanna M Wardlaw,et al.  Update on cerebral small vessel disease: a dynamic whole-brain disease , 2016, Stroke and Vascular Neurology.

[11]  G. Fuellen,et al.  FocusHeuristics – expression-data-driven network optimization and disease gene prediction , 2017, Scientific Reports.

[12]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[13]  L. Pantoni Cerebral small vessel disease: from pathogenesis and clinical characteristics to therapeutic challenges , 2010, The Lancet Neurology.

[14]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[15]  Andrew D. Johnson,et al.  Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes , 2018, Nature Genetics.

[16]  Albert-László Barabási,et al.  A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome , 2015, PLoS Comput. Biol..

[17]  Maren Kleine,et al.  A Survey of Gene Prioritization Tools for Mendelian and Complex Human Diseases , 2019, J. Integr. Bioinform..

[18]  Jiahui Liu,et al.  Prioritizing disease genes with an improved dual label propagation framework , 2018, BMC Bioinformatics.

[19]  Patrick Aloy,et al.  A reference map of the human binary protein interactome , 2020, Nature.

[20]  G. Vriend,et al.  A text-mining analysis of the human phenome , 2006, European Journal of Human Genetics.

[21]  Weidong Tian,et al.  GenePANDA—a novel network-based gene prioritizing tool for complex diseases , 2017, Scientific Reports.

[22]  F. Sanz,et al.  The DisGeNET knowledge platform for disease genomics: 2019 update , 2019, Nucleic Acids Res..

[23]  R. Jiang Walking on multiple disease-gene networks to prioritize candidate genes. , 2015, Journal of molecular cell biology.

[24]  Andrew D. Johnson,et al.  Multiancestry genome-wide association study of 520,000 subjects identifies 32 loci associated with stroke and stroke subtypes , 2018, Nature Genetics.

[25]  Artem Lysenko,et al.  Arete – candidate gene prioritization using biological network topology with additional evidence types , 2017, BioData Mining.

[26]  Jiajie Peng,et al.  Predicting Parkinson's Disease Genes Based on Node2vec and Autoencoder , 2019, Front. Genet..

[27]  M. Dichgans,et al.  Stroke genetics: discovery, biology, and clinical applications , 2019, The Lancet Neurology.