Gemini: memory-efficient integration of hundreds of gene networks with high-order pooling

Motivation The exponential growth of genomic sequencing data has created ever-expanding repositories of gene networks. Unsupervised network integration methods are critical to learn informative representations for each gene, which are later used as features for downstream applications. However, these network integration methods must be scalable to account for the increasing number of networks and robust to an uneven distribution of network types within hundreds of gene networks. Results To address these needs, we present Gemini, a novel network integration method that uses memory-efficient high-order pooling to represent and weight each network according to its uniqueness. Gemini then mitigates the uneven distribution through mixing up existing networks to create many new networks. We find that Gemini leads to more than a 10% improvement in F1 score, 14% improvement in micro-AUPRC, and 71% improvement in macro-AURPC for protein function prediction by integrating hundreds of networks from BioGRID, and that Gemini’s performance significantly improves when more networks are added to the input network collection, while the comparison approach’s performance deteriorates. Gemini thereby enables memory-efficient and informative network integration for large gene networks, and can be used to massively integrate and analyze networks in other domains. Availability Gemini can be accessed at: https://github.com/MinxZ/Gemini. Contact addiewc@cs.washington.edu, swang@cs.washington.edu

[1]  L. Cowen,et al.  Topsy-Turvy: integrating a global view into sequence-based PPI prediction , 2022, Bioinformatics.

[2]  E. Pennisi Upstart DNA sequencers could be a ‘game changer’ , 2022, Science.

[3]  Edward L. Huttlin,et al.  A multi-scale map of cell structure fusing protein images and interactions , 2021, Nature.

[4]  Gary D Bader,et al.  BIONIC: biological network integration using convolutions , 2021, bioRxiv.

[5]  Kara Dolinski,et al.  The BioGRID database: A comprehensive biomedical resource of curated protein, genetic, and chemical interactions , 2020, Protein science : a publication of the Protein Society.

[6]  Pan-Jun Kim,et al.  Large-scale metabolic interaction network of the mouse and human gut microbiota , 2020, Scientific Data.

[7]  Teresa M. Przytycka,et al.  Reconstruction of Gene Regulatory Networks by integrating biological model and a recommendation system , 2020, bioRxiv.

[8]  Benjamín J. Sánchez,et al.  A consensus S. cerevisiae metabolic model Yeast8 and its ecosystem for comprehensively probing cellular metabolism , 2019, Nature Communications.

[9]  Alicia R. Martin,et al.  Clinical use of current polygenic risk scores may exacerbate health disparities , 2019, Nature Genetics.

[10]  Bonnie Berger,et al.  Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape , 2019, bioRxiv.

[11]  Jiajie Peng,et al.  Integrating multi-network topology for gene function prediction using deep neural networks , 2019, bioRxiv.

[12]  Damian Szklarczyk,et al.  STRING v11: protein–protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets , 2018, Nucleic Acids Res..

[13]  Liwei Qiu,et al.  Scalable Multiplex Network Embedding , 2018, IJCAI.

[14]  Richard Bonneau,et al.  deepNF: deep network fusion for protein function prediction , 2017, bioRxiv.

[15]  Hongyi Zhang,et al.  mixup: Beyond Empirical Risk Minimization , 2017, ICLR.

[16]  José A. Guerrero-Martínez,et al.  Analysis of the relationship between coexpression domains and chromatin 3D organization , 2017, PLoS Comput. Biol..

[17]  Xiangxiang Zeng,et al.  Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[18]  Benjamin J. Raphael,et al.  Network propagation: a universal amplifier of genetic associations , 2017, Nature Reviews Genetics.

[19]  Lusheng Wang,et al.  Predicting Protein Functions by Using Unbalanced Random Walk Algorithm on Three Biological Networks , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  Bonnie Berger,et al.  Compact Integration of Multi-Network Topology for Functional Analysis of Genes. , 2016, Cell systems.

[21]  Alexis Battle,et al.  Co-expression networks reveal the tissue-specific regulation of transcription and splicing , 2019 .

[22]  Markus List,et al.  KeyPathwayMinerWeb: online multi-omics network enrichment , 2016, Nucleic Acids Res..

[23]  Fang-Xiang Wu,et al.  A fast and high performance multiple data integration algorithm for identifying human disease genes , 2015, BMC Medical Genomics.

[24]  Marinka Zitnik,et al.  Gene network inference by fusing data from diverse distributions , 2015, Bioinform..

[25]  Bonnie Berger,et al.  Exploiting ontology graph for predicting sparsely annotated gene function , 2015, Bioinform..

[26]  Xia Li,et al.  Prediction of potential disease-associated microRNAs based on random walk , 2015, Bioinform..

[27]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[28]  M. Cugmas,et al.  On comparing partitions , 2015 .

[29]  Bonnie Berger,et al.  Diffusion Component Analysis: Unraveling Functional Topology in Biological Networks , 2015, RECOMB.

[30]  K. K. Sahu,et al.  Normalization: A Preprocessing Stage , 2015, ArXiv.

[31]  Benjamin J. Raphael,et al.  Pan-Cancer Network Analysis Identifies Combinations of Rare Somatic Mutations across Pathways and Protein Complexes , 2014, Nature Genetics.

[32]  Prudence Mutowo-Meullenet,et al.  The GOA database: Gene Ontology annotation updates for 2015 , 2014, Nucleic Acids Res..

[33]  Fang-Xiang Wu,et al.  Identifying disease genes by integrating multiple data sources , 2014, BMC Medical Genomics.

[34]  Fang-Xiang Wu,et al.  Disease gene identification by using graph kernels and Markov random fields , 2014, Science China Life Sciences.

[35]  Fang-Xiang Wu,et al.  Disease gene identification by using graph kernels and Markov random fields , 2014, Science China Life Sciences.

[36]  Damian Smedley,et al.  Walking the interactome for candidate prioritization in exome sequencing studies of Mendelian diseases , 2014, Bioinform..

[37]  Lenore Cowen,et al.  New directions for diffusion-based network prediction of protein function: incorporating pathways with confidence , 2014, Bioinform..

[38]  Zhuowen Tu,et al.  Similarity network fusion for aggregating data types on a genomic scale , 2014, Nature Methods.

[39]  Hsiang-Yuan Yeh,et al.  Inferring drug-disease associations from integration of chemical, genomic and phenotype data using network propagation , 2013, BMC Medical Genomics.

[40]  Noah M. Daniels,et al.  Going the Distance for Protein Function Prediction: A New Distance Metric for Protein Interaction Networks , 2013, PloS one.

[41]  T. Ideker,et al.  Integrative approaches for finding modular structure in biological networks , 2013, Nature Reviews Genetics.

[42]  Andrew M. Gross,et al.  Network-based stratification of tumor mutations , 2013, Nature Methods.

[43]  Fang-Xiang Wu,et al.  Identifying Protein Complexes Based on Multiple Topological Structures in PPI Networks , 2013, IEEE Transactions on NanoBioscience.

[44]  Diogo M. Camacho,et al.  Wisdom of crowds for robust gene network inference , 2012, Nature Methods.

[45]  Xing Chen,et al.  Drug-target interaction prediction by random walk on the heterogeneous network. , 2012, Molecular bioSystems.

[46]  P. Benfey,et al.  A Gene Regulatory Network for Root Epidermis Cell Differentiation in Arabidopsis , 2012, PLoS genetics.

[47]  E. Marcotte,et al.  Prioritizing candidate disease genes by network-based boosting of genome-wide association data. , 2011, Genome research.

[48]  A. Barabasi,et al.  Interactome Networks and Human Disease , 2011, Cell.

[49]  Gary D. Bader,et al.  The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function , 2010, Nucleic Acids Res..

[50]  Quaid Morris,et al.  Fast integration of heterogeneous data sources for predicting gene function with limited annotation , 2010, Bioinform..

[51]  Jagdish Chandra Patra,et al.  Genome-wide inferring gene-phenotype relationship by walking on the heterogeneous network , 2010, Bioinform..

[52]  Shang-Hua Teng,et al.  Finding local communities in protein networks , 2009, BMC Bioinformatics.

[53]  Luonan Chen,et al.  Network‐Based Prediction of Protein Function , 2009 .

[54]  Jing Chen,et al.  Disease candidate gene identification and prioritization using protein interaction networks , 2009, BMC Bioinformatics.

[55]  Dianne P. O'Leary,et al.  Why Do Hubs in the Yeast Protein Interaction Network Tend To Be Essential: Reexamining the Connection between the Network Topology and Essentiality , 2008, PLoS Comput. Biol..

[56]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[57]  R. Sharan,et al.  Network-based prediction of protein function , 2007, Molecular systems biology.

[58]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[59]  Ting Chen,et al.  Diffusion kernel-based logistic regression models for protein function prediction. , 2006, Omics : a journal of integrative biology.

[60]  M. Vidal,et al.  Interactome: gateway into systems biology. , 2005, Human molecular genetics.

[61]  R. Karp,et al.  From the Cover : Conserved patterns of protein interaction in multiple species , 2005 .

[62]  Bernhard Schölkopf,et al.  Fast protein classification with multiple networks , 2005, ECCB/JBI.

[63]  Nello Cristianini,et al.  A statistical framework for genomic data fusion , 2004, Bioinform..

[64]  T. Lumley,et al.  PRINCIPAL COMPONENT ANALYSIS AND FACTOR ANALYSIS , 2004, Statistical Methods for Biomedical Research.

[65]  William Stafford Noble,et al.  Learning kernels from biological networks by maximizing entropy , 2004, ISMB/ECCB.

[66]  Gary D Bader,et al.  Global Mapping of the Yeast Genetic Interaction Network , 2004, Science.

[67]  P. Bork,et al.  Functional organization of the yeast proteome by systematic analysis of protein complexes , 2002, Nature.

[68]  Gary D Bader,et al.  Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry , 2002, Nature.

[69]  B. Schwikowski,et al.  A network of protein–protein interactions in yeast , 2000, Nature Biotechnology.

[70]  M. Daly,et al.  Guilt by association , 2000, Nature Genetics.

[71]  James R. Knight,et al.  A comprehensive analysis of protein–protein interactions in Saccharomyces cerevisiae , 2000, Nature.

[72]  P. Rosenbaum Model-Based Direct Adjustment , 1987 .

[73]  S. Brunak,et al.  A scored human protein–protein interaction network to catalyze genomic interpretation , 2017, Nature Methods.

[74]  Roded Sharan,et al.  Inference of Personalized Drug Targets via Network Propagation , 2016, PSB.

[75]  A. Barabasi,et al.  Uncovering disease-disease relationships through the incomplete interactome , 2015, Science.

[76]  N. Gulbahce,et al.  Network medicine: a network-based approach to human disease , 2010, Nature Reviews Genetics.

[77]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[78]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology , 2004, Nucleic Acids Res..

[79]  G. Casari,et al.  A physical and functional map of the human TNF-alpha/NF-kappa B signal transduction pathway. , 2004, Nature cell biology.