Inferring gene-gene interactions and functional modules using sparse canonical correlation analysis

Networks pervade many disciplines of science for analyzing complex systems with interacting components. In particular, this concept is commonly used to model interactions between genes and identify closely associated genes forming functional modules. In this paper, we focus on gene group interactions and infer these interactions using appropriate partial correlations between genes, that is, the conditional dependencies between genes after removing the influences of a set of other functionally related genes. We introduce a new method for estimating group interactions using sparse canonical correlation analysis (SCCA) coupled with repeated random partition and subsampling of the gene expression data set. By considering different subsets of genes and ways of grouping them, our interaction measure can be viewed as an aggregated estimate of partial correlations of different orders. Our approach is unique in evaluating conditional dependencies when the correct dependent sets are unknown or only partially known. As a result, a gene network can be constructed using the interaction measures as edge weights and gene functional groups can be inferred as tightly connected communities from the network. Comparisons with several popular approaches using simulated and real data show our procedure improves both the statistical significance and biological interpretability of the results. In addition to achieving considerably lower false positive rates, our procedure shows better performance in detecting important biological pathways.

[1]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[2]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[3]  D. Tritchler,et al.  Sparse Canonical Correlation Analysis with Application to Genomic Data Integration , 2009, Statistical applications in genetics and molecular biology.

[4]  Aidong Zhang,et al.  Cluster analysis for gene expression data: a survey , 2004, IEEE Transactions on Knowledge and Data Engineering.

[5]  Bin Zhang,et al.  Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R , 2008, Bioinform..

[6]  Xinbin Dai,et al.  Genome-wide analysis of phenylpropanoid defence pathways. , 2010, Molecular plant pathology.

[7]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockmodels for Graphs with Latent Block Structure , 1997 .

[8]  Edoardo M. Airoldi,et al.  Mixed Membership Stochastic Blockmodels , 2007, NIPS.

[9]  R. Tibshirani,et al.  A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. , 2009, Biostatistics.

[10]  Michael Mitzenmacher,et al.  Detecting Novel Associations in Large Data Sets , 2011, Science.

[11]  M. Dekker,et al.  Glucosinolates in Brassica vegetables: the influence of the food supply chain on intake, bioavailability and human health. , 2009, Molecular nutrition & food research.

[12]  Daniela M Witten,et al.  Extensions of Sparse Canonical Correlation Analysis with Applications to Genomic Data , 2009, Statistical applications in genetics and molecular biology.

[13]  M. Hawes,et al.  Flavonoids: from cell cycle regulation to biotechnology , 2005, Biotechnology Letters.

[14]  John Scott,et al.  The SAGE Handbook of Social Network Analysis , 2011 .

[15]  Min Xu,et al.  High-dimensional Covariance Estimation Based On Gaussian Graphical Models , 2010, J. Mach. Learn. Res..

[16]  P. Bickel,et al.  A nonparametric view of network models and Newman–Girvan and other modularities , 2009, Proceedings of the National Academy of Sciences.

[17]  R. Mittler,et al.  The Water-Water Cycle Is Essential for Chloroplast Protection in the Absence of Stress* , 2003, Journal of Biological Chemistry.

[18]  P. Bühlmann,et al.  Statistical Applications in Genetics and Molecular Biology Low-Order Conditional Independence Graphs for Inferring Genetic Networks , 2011 .

[19]  Heather J. Ruskin,et al.  Techniques for clustering gene expression data , 2008, Comput. Biol. Medicine.

[20]  T. Snijders,et al.  Estimation and Prediction for Stochastic Blockstructures , 2001 .

[21]  P. Bühlmann,et al.  Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana , 2004, Genome Biology.

[22]  Mark E. J. Newman,et al.  Stochastic blockmodels and community structure in networks , 2010, Physical review. E, Statistical, nonlinear, and soft matter physics.

[23]  Haiyan Huang,et al.  A Statistical Framework to Infer Functional Gene Relationships From Biologically Interrelated Microarray Experiments , 2009 .

[24]  I. Sønderby,et al.  Biosynthesis of glucosinolates--gene discovery and beyond. , 2010, Trends in plant science.

[25]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[26]  E. Grotewold,et al.  Flavonoids as developmental regulators. , 2005, Current opinion in plant biology.

[27]  J. Daudin,et al.  Classification and estimation in the Stochastic Block Model based on the empirical degrees , 2011, 1110.6517.

[28]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[29]  A. Zwinderman,et al.  Statistical Applications in Genetics and Molecular Biology Quantifying the Association between Gene Expressions and DNA-Markers by Penalized Canonical Correlation Analysis , 2011 .

[30]  John Scott Social Network Analysis , 1988 .

[31]  Karen Schlauch,et al.  Cytosolic Ascorbate Peroxidase 1 Is a Central Component of the Reactive Oxygen Gene Network of Arabidopsisw⃞ , 2005, The Plant Cell Online.

[32]  T. W. Anderson Asymptotic Theory for Canonical Correlation Analysis , 1999 .

[33]  Lee Woojoo,et al.  Sparse Canonical Covariance Analysis for High-throughput Data , 2011 .

[34]  Peter Langfelder,et al.  Eigengene networks for studying the relationships between co-expression modules , 2007, BMC Systems Biology.

[35]  N. Meinshausen,et al.  High-dimensional graphs and variable selection with the Lasso , 2006, math/0608017.

[36]  Carsten O. Daub,et al.  Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data , 2004, BMC Bioinformatics.

[37]  Peter J. Bickel,et al.  Pseudo-likelihood methods for community detection in large sparse networks , 2012, 1207.2340.

[38]  Xiufeng Yan,et al.  Regulation of plant glucosinolate metabolism , 2007, Planta.

[39]  R. Quatrano,et al.  Arabidopsis Transcriptome Reveals Control Circuits Regulating Redox Homeostasis and the Role of an AP2 Transcription Factor1[W][OA] , 2008, Plant Physiology.

[40]  Graham J. Wills,et al.  Introduction to graphical modelling , 1995 .

[41]  Haiyan Huang,et al.  Review on statistical methods for gene network reconstruction using expression data. , 2014, Journal of theoretical biology.

[42]  Carsten O. Daub,et al.  The mutual information: Detecting and evaluating dependencies between variables , 2002, ECCB.

[43]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[44]  Haiyan Huang,et al.  1 A Statistical Framework to Infer Functional Gene Associations from Multiple Biologically Interrelated Microarray Experiments , 2006 .

[45]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[46]  Haiyan Huang,et al.  Using biologically interrelated experiments to identify pathway genes in Arabidopsis , 2012, Bioinform..

[47]  Paul M. Magwene,et al.  Estimating genomic coexpression networks using first-order conditional independence , 2004, Genome Biology.

[48]  Franck Picard,et al.  A mixture model for random graphs , 2008, Stat. Comput..