New gene association measures by joint network embedding of multiple gene expression datasets

Large number of samples are required to construct a reliable gene co-expression network, the samples from a single gene expression dataset are obviously not enough. However, batch effect may widely exist among datasets due to different experimental conditions. We proposed JEBIN (Joint Embedding of multiple BIpartite Networks) algorithm, it can learn a low-dimensional representation vector for each gene by integrating multiple bipartite networks, and each network corresponds to one dataset. JEBIN owns many inherent advantages, such as it is a nonlinear, global model, has linear time complexity with the number of genes, dataset or samples, and can integrate datasets with different distribution. We verified the effectiveness and scalability of JEBIN through a series of simulation experiments, and proved better performance on real biological data than commonly used integration algorithms. In addition, we conducted a differential co-expression analysis of hepatocellular carcinoma between the single-cell and bulk RNA-seq data, and also a contrast between the hepatocellular carcinoma and its adjacency samples using the bulk RNA-seq data. Analysis results prove that JEBIN can obtain comprehensive and stable gene co-expression networks through integrating multiple datasets and has wide prospect in the functional annotation of unknown genes and the regulatory mechanism inference of target genes.

[1]  E. Sprinzak,et al.  Prediction of gene function by genome-scale expression analysis: prostate cancer-associated genes. , 1999, Genome research.

[2]  Yudong D. He,et al.  Functional Discovery via a Compendium of Expression Profiles , 2000, Cell.

[3]  Michael Griffin,et al.  Gene co-expression network topology provides a framework for molecular characterization of cellular state , 2004, Bioinform..

[4]  R. Strohman,et al.  Maneuvering in the Complex Path from Genotype to Phenotype , 2002, Science.

[5]  Jugal K. Kalita,et al.  Reconstruction of gene co-expression network from microarray data using local expression patterns , 2014, BMC Bioinformatics.

[6]  M. Yuan,et al.  Model selection and estimation in the Gaussian graphical model , 2007 .

[7]  Sara Ballouz,et al.  Exploiting single-cell expression to characterize co-expression replicability , 2016, Genome Biology.

[8]  Lin Song,et al.  Comparison of co-expression measures: mutual information, correlation, and model based indices , 2012, BMC Bioinformatics.

[9]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[10]  Mingzhe Wang,et al.  LINE: Large-scale Information Network Embedding , 2015, WWW.

[11]  Sara Ballouz,et al.  Guidance for RNA-seq co-expression network construction and analysis: safety in numbers , 2015, Bioinform..

[12]  Haiyan Huang,et al.  Review on statistical methods for gene network reconstruction using expression data. , 2014, Journal of theoretical biology.

[13]  Alexander J. Smola,et al.  Reducing the sampling complexity of topic models , 2014, KDD.

[14]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Thomas Craig,et al.  GeneFriends: An online co-expression analysis tool to identify novel gene targets for aging and complex diseases , 2012, BMC Genomics.

[16]  Daniel A. Chamovitz,et al.  Large-scale analysis of Arabidopsis transcription reveals a basal co-regulation network , 2009, BMC Systems Biology.

[17]  Hong Yan,et al.  Differential network analysis from cross-platform gene expression data , 2016, Scientific Reports.

[18]  Steve Horvath,et al.  WGCNA: an R package for weighted correlation network analysis , 2008, BMC Bioinformatics.

[19]  Paul Pavlidis,et al.  The role of indirect connections in gene networks in predicting function , 2011, Bioinform..

[20]  O. Alter,et al.  A Higher-Order Generalized Singular Value Decomposition for Comparison of Global mRNA Expression from Multiple Organisms , 2011, PloS one.

[21]  Stephen J. Wright,et al.  Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent , 2011, NIPS.

[22]  Laurin A. J. Mueller,et al.  Integrative Network Biology: Graph Prototyping for Co-Expression Cancer Networks , 2011, PloS one.

[23]  L. Foster,et al.  Evaluating measures of association for single-cell transcriptomics , 2019, Nature Methods.

[24]  Qiaozhu Mei,et al.  PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks , 2015, KDD.

[25]  David Warde-Farley,et al.  GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function , 2008, Genome Biology.

[26]  Sarah A. Teichmann,et al.  Computational approaches for interpreting scRNA‐seq data , 2017, FEBS letters.

[27]  Patrick Danaher,et al.  The joint graphical lasso for inverse covariance estimation across multiple classes , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[28]  Daphne Koller,et al.  Sharing and Specificity of Co-expression Networks across 35 Human Tissues , 2014, PLoS Comput. Biol..

[29]  Su-In Lee,et al.  Efficient Dimensionality Reduction for High-Dimensional Network Estimation , 2014, ICML.

[30]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[31]  D. Botstein,et al.  Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Michael Q. Zhang,et al.  Network embedding-based representation learning for single cell RNA-seq data , 2017, Nucleic acids research.

[33]  Alexandre d'Aspremont,et al.  Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data , 2022 .

[34]  Cheng Zhang,et al.  Investigating the Combinatory Effects of Biological Networks on Gene Co-expression , 2016, Frontiers in physiology.

[35]  Enrico Petretto,et al.  Multi-tissue Analysis of Co-expression Networks by Higher-Order Generalized Singular Value Decomposition Identifies Functionally Coherent Transcriptional Modules , 2014, PLoS genetics.

[36]  João Pedro de Magalhães,et al.  GeneFriends: a human RNA-seq-based gene and transcript co-expression database , 2014, Nucleic Acids Res..

[37]  S. Horvath,et al.  Statistical Applications in Genetics and Molecular Biology , 2011 .

[38]  Benjamin A. Logsdon,et al.  Extracting a low-dimensional description of multiple gene expression datasets reveals a potential driver for tumor-associated stroma in ovarian cancer , 2016, bioRxiv.

[39]  F. Azuaje,et al.  Analysis of a gene co-expression network establishes robust association between Col5a2 and ischemic heart disease , 2013, BMC Medical Genomics.