Jumping across biomedical contexts using compressive data fusion

Motivation: The rapid growth of diverse biological data allows us to consider interactions between a variety of objects, such as genes, chemicals, molecular signatures, diseases, pathways and environmental exposures. Often, any pair of objects—such as a gene and a disease—can be related in different ways, for example, directly via gene–disease associations or indirectly via functional annotations, chemicals and pathways. Different ways of relating these objects carry different semantic meanings. However, traditional methods disregard these semantics and thus cannot fully exploit their value in data modeling. Results: We present Medusa, an approach to detect size-k modules of objects that, taken together, appear most significant to another set of objects. Medusa operates on large-scale collections of heterogeneous datasets and explicitly distinguishes between diverse data semantics. It advances research along two dimensions: it builds on collective matrix factorization to derive different semantics, and it formulates the growing of the modules as a submodular optimization program. Medusa is flexible in choosing or combining semantic meanings and provides theoretical guarantees about detection quality. In a systematic study on 310 complex diseases, we show the effectiveness of Medusa in associating genes with diseases and detecting disease modules. We demonstrate that in predicting gene–disease associations Medusa compares favorably to methods that ignore diverse semantic meanings. We find that the utility of different semantics depends on disease categories and that, overall, Medusa recovers disease modules more accurately when combining different semantics. Availability and implementation: Source code is at http://github.com/marinkaz/medusa Contact: marinka@cs.stanford.edu, blaz.zupan@fri.uni-lj.si

[1]  D. H. Fowler,et al.  The Binomial Coefficient Function , 1996 .

[2]  Daniel S. Himmelstein,et al.  Heterogeneous Network Edge Prediction: A Data Integration Approach to Prioritize Disease-Associated Genes , 2014, bioRxiv.

[3]  Xiang Li,et al.  Classification with Active Learning and Meta-Paths in Heterogeneous Information Networks , 2015, CIKM.

[4]  Nagarajan Natarajan,et al.  Inductive matrix completion for predicting gene–disease associations , 2014, Bioinform..

[5]  P. Robinson,et al.  Walking the interactome for prioritization of candidate disease genes. , 2008, American journal of human genetics.

[6]  Insuk Lee,et al.  RIDDLE: reflective diffusion and local extension reveal functional associations for unannotated gene sets via proximity in a gene network , 2012, Genome Biology.

[7]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[8]  Y. Moreau,et al.  Computational tools for prioritizing candidate genes: boosting disease gene discovery , 2012, Nature Reviews Genetics.

[9]  Kara Dolinski,et al.  The BioGRID interaction database: 2015 update , 2014, Nucleic Acids Res..

[10]  M. Vidal,et al.  Selecting causal genes from genome-wide association studies via functionally coherent subnetworks , 2014, Nature Methods.

[11]  Philip S. Yu,et al.  PathSim , 2011, Proc. VLDB Endow..

[12]  Jack Edmonds,et al.  Submodular Functions, Matroids, and Certain Polyhedra , 2001, Combinatorial Optimization.

[13]  Carl Kingsford,et al.  The power of protein interaction networks for associating genes with diseases , 2010, Bioinform..

[14]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Marinka Zitnik,et al.  Data Fusion by Matrix Factorization , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Elspeth A. Bruford,et al.  Genenames.org: the HGNC resources in 2015 , 2014, Nucleic Acids Res..

[17]  M. Ritchie,et al.  Methods of integrating data to uncover genotype–phenotype interactions , 2015, Nature Reviews Genetics.

[18]  B. Zupan,et al.  Discovering disease-disease associations by fusing systems-level molecular data , 2013, Scientific Reports.

[19]  Andreas Krause,et al.  Submodularity and its applications in optimized information gathering , 2011, TIST.

[20]  Alexandre P. Francisco,et al.  Interactogeneous: Disease Gene Prioritization Using Heterogeneous Networks and Full Topology Scores , 2012, PloS one.

[21]  藤重 悟 Submodular functions and optimization , 1991 .

[22]  A. Barabasi,et al.  Human symptoms–disease network , 2014, Nature Communications.

[23]  Jagdish Chandra Patra,et al.  Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network , 2010, Bioinform..

[24]  G. Sumara,et al.  A Probabilistic Functional Network of Yeast Genes , 2004 .

[25]  M. L. Fisher,et al.  An analysis of approximations for maximizing submodular set functions—I , 1978, Math. Program..

[26]  Jennifer M. Rust,et al.  The BioGRID Interaction Database , 2011 .

[27]  A. Barabasi,et al.  Network medicine : a network-based approach to human disease , 2010 .

[28]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 , 2014, Nucleic Acids Res..

[29]  Charu C. Aggarwal,et al.  Co-author Relationship Prediction in Heterogeneous Bibliographic Networks , 2011, 2011 International Conference on Advances in Social Networks Analysis and Mining.

[30]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[31]  Daniel S. Himmelstein,et al.  Understanding multicellular function and disease with human tissue-specific networks , 2015, Nature Genetics.

[32]  Philip S. Yu,et al.  Integrating meta-path selection with user-guided object clustering in heterogeneous information networks , 2012, KDD.

[33]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[34]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[35]  Darcy A. Davis,et al.  Exploring and Exploiting Disease Interactions from Multi-Relational Gene and Phenotype Networks , 2011, PloS one.

[36]  C. Mattingly,et al.  The Comparative Toxicogenomics Database (CTD). , 2003, Environmental health perspectives.

[37]  Marinka Zitnik,et al.  Collective Pairwise Classification for Multi-Way Analysis of Disease and Drug Data , 2016, PSB.

[38]  Marinka Zitnik,et al.  Gene Prioritization by Compressive Data Fusion and Chaining , 2015, PLoS Comput. Biol..

[39]  Albert-László Barabási,et al.  A DIseAse MOdule Detection (DIAMOnD) Algorithm Derived from a Systematic Analysis of Connectivity Patterns of Disease Proteins in the Human Interactome , 2015, PLoS Comput. Biol..

[40]  Hongyu Zhao,et al.  Integrating GWASs and human protein interaction networks identifies a gene subnetwork underlying alcohol dependence. , 2013, American journal of human genetics.

[41]  Roded Sharan,et al.  Network-Based Integration of Disparate Omic Data To Identify "Silent Players" in Cancer , 2015, PLoS Comput. Biol..

[42]  Roded Sharan,et al.  Associating Genes and Protein Complexes with Disease via Network Propagation , 2010, PLoS Comput. Biol..