An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods

Objective In the context of “network medicine”, gene prioritization methods represent one of the main tools to discover candidate disease genes by exploiting the large amount of data covering different types of functional relationships between genes. Several works proposed to integrate multiple sources of data to improve disease gene prioritization, but to our knowledge no systematic studies focused on the quantitative evaluation of the impact of network integration on gene prioritization. In this paper, we aim at providing an extensive analysis of gene-disease associations not limited to genetic disorders, and a systematic comparison of different network integration methods for gene prioritization. Materials and methods We collected nine different functional networks representing different functional relationships between genes, and we combined them through both unweighted and weighted network integration methods. We then prioritized genes with respect to each of the considered 708 medical subject headings (MeSH) diseases by applying classical guilt-by-association, random walk and random walk with restart algorithms, and the recently proposed kernelized score functions. Results The results obtained with classical random walk algorithms and the best single network achieved an average area under the curve (AUC) across the 708 MeSH diseases of about 0.82, while kernelized score functions and network integration boosted the average AUC to about 0.89. Weighted integration, by exploiting the different “informativeness” embedded in different functional networks, outperforms unweighted integration at 0.01 significance level, according to the Wilcoxon signed rank sum test. For each MeSH disease we provide the top-ranked unannotated candidate genes, available for further bio-medical investigation. Conclusions Network integration is necessary to boost the performances of gene prioritization methods. Moreover the methods based on kernelized score functions can further enhance disease gene ranking results, by adopting both local and global learning strategies, able to exploit the overall topology of the network.

[1]  P. Aloy,et al.  A network medicine approach to human disease , 2009, FEBS letters.

[2]  Jagdish Chandra Patra,et al.  Integration of multiple data sources to prioritize candidate genes using discounted rating system , 2010, BMC Bioinformatics.

[3]  László Lovász,et al.  Random Walks on Graphs: A Survey , 1993 .

[4]  中尾 光輝,et al.  KEGG(Kyoto Encyclopedia of Genes and Genomes)〔和文〕 (特集 ゲノム医学の現在と未来--基礎と臨床) -- (データベース) , 2000 .

[5]  Yingyao Zhou,et al.  In Silico Gene Prioritization by Integrating Multiple Data Sources , 2011, PloS one.

[6]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[7]  Harm van Bakel,et al.  TEAM: a tool for the integration of expression, and linkage and association maps , 2004, European Journal of Human Genetics.

[8]  Nicolò Cesa-Bianchi,et al.  Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference , 2012, Machine Learning.

[9]  Giorgio Valentini,et al.  A Fast Ranking Algorithm for Predicting Gene Functions in Biomolecular Networks , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[10]  Giorgio Valentini,et al.  Large Scale Ranking and Repositioning of Drugs with Respect to DrugBank Therapeutic Categories , 2012, ISBRA.

[11]  Rui Jiang,et al.  Constructing a gene semantic similarity network for the inference of disease genes , 2011, BMC Systems Biology.

[12]  L. Asz Random Walks on Graphs: a Survey , 2022 .

[13]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[14]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[15]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[16]  Jan Freudenberg,et al.  A similarity-based method for genome-wide prediction of disease-relevant human genes , 2002, ECCB.

[17]  Pericles A. Mitkas,et al.  Data Mining in Proteomics Using Grid Computing , 2009 .

[18]  Xiaowei Xu,et al.  Constructing a robust protein-protein interaction network by integrating multiple public databases , 2011, BMC Bioinformatics.

[19]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[20]  Steven Salzberg,et al.  JIGSAW: integration of multiple sources of evidence for gene prediction , 2005, Bioinform..

[21]  Carolina Perez-Iratxeta,et al.  Linking genes to diseases: it's all in the data , 2009, Genome Medicine.

[22]  Tijl De Bie,et al.  Kernel-based data fusion for gene prioritization , 2007, ISMB/ECCB.

[23]  Michael Schroeder,et al.  Facts from text: can text mining help to scale-up high-quality manual curation of gene products with ontologies? , 2008, Briefings Bioinform..

[24]  Duc-Hau Le,et al.  Neighbor-favoring weight reinforcement to improve random walk-based disease gene prioritization , 2013, Comput. Biol. Chem..

[25]  Haixuan Yang,et al.  Improving GO semantic similarity measures by exploring the ontology beneath the terms and modelling uncertainty , 2012, Bioinform..

[26]  Yana Bromberg,et al.  Chapter 15: Disease Gene Prioritization , 2013, PLoS Comput. Biol..

[27]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[28]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[29]  Thomas Lengauer,et al.  Improving disease gene prioritization using the semantic similarity of Gene Ontology terms , 2010, Bioinform..

[30]  Bart De Moor,et al.  eXtasy: variant prioritization by genomic data fusion , 2013, Nature Methods.

[31]  Lewis Y. Geer,et al.  Database resources of the National Center for Biotechnology Information , 2014, Nucleic Acids Res..

[32]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[33]  Bart De Moor,et al.  Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining , 2008, ECCB.

[34]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[35]  Graciela Gonzalez,et al.  Towards Integrative Gene Prioritization in Alzheimer's Disease , 2011, Pacific Symposium on Biocomputing.

[36]  Sanghamitra Bandyopadhyay,et al.  A Weighted Power Framework for Integrating Multisource Information: Gene Function Prediction in Yeast , 2012, IEEE Transactions on Biomedical Engineering.

[37]  P. Bork,et al.  Association of genes to genetically inherited diseases using data mining , 2002, Nature Genetics.

[38]  Marylyn D. Ritchie,et al.  Pacific Symposium on Biocomputing 14:368-379 (2009) BIOFILTER: A KNOWLEDGE-INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-WIDE ASSOCIATION STUDIES * , 2022 .

[39]  A. Valencia,et al.  Linking genes to literature: text mining, information extraction, and retrieval applications for biology , 2008, Genome Biology.

[40]  L. Stein,et al.  A human functional protein interaction network and its application to cancer data analysis , 2010, Genome Biology.

[41]  Christie S. Chang,et al.  The BioGRID interaction database: 2013 update , 2012, Nucleic Acids Res..

[42]  Zoubin Ghahramani,et al.  Gene function prediction from synthetic lethality networks via ranking on demand , 2010, Bioinform..

[43]  Giorgio Valentini,et al.  Ensemble methods : a review , 2012 .

[44]  Giorgio Valentini,et al.  Cancer module genes ranking using kernelized score functions , 2012, BMC Bioinformatics.

[45]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2013 , 2012, Nucleic Acids Res..

[46]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[47]  Tianzi Jiang,et al.  Exploring candidate genes for human brain diseases from a brain-specific gene network. , 2006, Biochemical and biophysical research communications.

[48]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[49]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database: update 2011 , 2010, Nucleic Acids Res..

[50]  A. Fraser,et al.  A probabilistic view of gene function , 2004, Nature Genetics.

[51]  Alfonso Valencia,et al.  Translational disease interpretation with molecular networks , 2009, Genome Biology.

[52]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[53]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[55]  Mark Gerstein,et al.  Interpretation of Genomic Variants Using a Unified Biological Network Approach , 2013, PLoS Comput. Biol..

[56]  Masashi Sugiyama,et al.  Integration of Multiple Networks for Robust Label Propagation , 2008, SDM.

[57]  Hamid Bolouri,et al.  A data integration methodology for systems biology. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[58]  R. Piro,et al.  Computational approaches to disease‐gene prediction: rationale, classification and successes , 2012, The FEBS journal.

[59]  Alan F. Scott,et al.  McKusick's Online Mendelian Inheritance in Man (OMIM®) , 2008, Nucleic Acids Res..

[60]  Ashok N. Srivastava,et al.  Advances in Machine Learning and Data Mining for Astronomy , 2012 .

[61]  Tu-Bao Ho,et al.  Detecting disease genes based on semi-supervised learning and protein-protein interaction networks , 2012, Artif. Intell. Medicine.

[62]  Ting Chen,et al.  An Integrated Probabilistic Model for Functional Prediction of Proteins , 2004, J. Comput. Biol..

[63]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[64]  Giorgio Valentini,et al.  Network-Based Drug Ranking and Repositioning with Respect to DrugBank Therapeutic Categories , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[65]  Alexandre P. Francisco,et al.  Interactogeneous: Disease Gene Prioritization Using Heterogeneous Networks and Full Topology Scores , 2012, PloS one.

[66]  Alexander J. Smola,et al.  Kernels and Regularization on Graphs , 2003, COLT.

[67]  Giorgio Valentini,et al.  Simple ensemble methods are competitive with state-of-the-art data integration methods for gene function prediction , 2010, MLSB.

[68]  Jonathan D. G. Jones,et al.  Evidence for Network Evolution in an Arabidopsis Interactome Map , 2011, Science.

[69]  Frances S. Turner,et al.  POCUS: mining genomic sequence annotation to predict disease genes , 2003, Genome Biology.

[70]  D. Koller,et al.  A module map showing conditional activity of expression modules in cancer , 2004, Nature Genetics.

[71]  E. Schadt Molecular networks as sensors and drivers of common human diseases , 2009, Nature.

[72]  Ivan Molineris,et al.  An atlas of tissue-specific conserved coexpression for functional annotation and disease gene prediction , 2011, European Journal of Human Genetics.

[73]  C. Wijmenga,et al.  Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. , 2006, American journal of human genetics.

[74]  Maricel G. Kann,et al.  Protein interactions and disease: computational approaches to uncover the etiology of diseases , 2007, Briefings Bioinform..