Learning Transcriptional Regulatory Relationships Using Sparse Graphical Models

Understanding the organization and function of transcriptional regulatory networks by analyzing high-throughput gene expression profiles is a key problem in computational biology. The challenges in this work are 1) the lack of complete knowledge of the regulatory relationship between the regulators and the associated genes, 2) the potential for spurious associations due to confounding factors, and 3) the number of parameters to learn is usually larger than the number of available microarray experiments. We present a sparse (L1 regularized) graphical model to address these challenges. Our model incorporates known transcription factors and introduces hidden variables to represent possible unknown transcription and confounding factors. The expression level of a gene is modeled as a linear combination of the expression levels of known transcription factors and hidden factors. Using gene expression data covering 39,296 oligonucleotide probes from 1109 human liver samples, we demonstrate that our model better predicts out-of-sample data than a model with no hidden variables. We also show that some of the gene sets associated with hidden variables are strongly correlated with Gene Ontology categories. The software including source code is available at http://grnl1.codeplex.com.

[1]  R. Fildes Journal of the Royal Statistical Society (B): Gary K. Grunwald, Adrian E. Raftery and Peter Guttorp, 1993, “Time series of continuous proportions”, 55, 103–116.☆ , 1993 .

[2]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[5]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[6]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[7]  Ron Shamir,et al.  Clustering Gene Expression Patterns , 1999, J. Comput. Biol..

[8]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[9]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[11]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[12]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[13]  Chiara Sabatti,et al.  Network component analysis: Reconstruction of regulatory signals in biological systems , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[15]  D. Ruppert The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2004 .

[16]  David N. Messina,et al.  An ORFeome-based analysis of human transcription factor genes and the construction of a microarray to interrogate their expression. , 2004, Genome research.

[17]  A. Ng Feature selection, L1 vs. L2 regularization, and rotational invariance , 2004, Twenty-first international conference on Machine learning - ICML '04.

[18]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[19]  Neil D. Lawrence,et al.  Probabilistic inference of transcription factor concentrations and gene-specific regulatory activities , 2006, Bioinform..

[20]  Chiara Sabatti,et al.  Bayesian sparse hidden components analysis for transcription regulation networks , 2005, Bioinform..

[21]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[22]  Alvis Brazma,et al.  Current approaches to gene regulatory network modelling , 2007, BMC Bioinformatics.

[23]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[24]  Jianfeng Gao,et al.  Scalable training of L1-regularized log-linear models , 2007, ICML '07.

[25]  Xiang-Jun Lu,et al.  Inferring Condition-Specific Modulation of Transcription Factor Activity in Yeast through Regulon-Based Analysis of Genomewide Expression , 2008, PloS one.

[26]  Zhi Ding,et al.  Fast network component analysis (FastNCA) for gene regulatory network reconstruction from microarray data , 2008, Bioinform..

[27]  Chun Jimmie Ye,et al.  Accurate Discovery of Expression Quantitative Trait Loci Under Confounding From Spurious and Genuine Regulatory Hotspots , 2008, Genetics.

[28]  Oliver Stegle,et al.  Accounting for Non-genetic Factors Improves the Power of eQTL Studies , 2008, RECOMB.

[29]  Scott A. Rifkin,et al.  Revealing the architecture of gene regulation: the promise of eQTL studies. , 2008, Trends in genetics : TIG.

[30]  John D. Storey,et al.  Mapping the Genetic Architecture of Gene Expression in Human Liver , 2008, PLoS biology.

[31]  David A. Drubin,et al.  Learning a Prior on Regulatory Potential from eQTL Data , 2009, PLoS genetics.

[32]  Mariano J. Alvarez,et al.  Genome-wide Identification of Post-translational Modulators of Transcription Factor Activity in Human B-Cells , 2009, Nature Biotechnology.

[33]  Jennifer G. Dy,et al.  Sparse Probabilistic Principal Component Analysis , 2009, AISTATS.

[34]  Ralf Herwig,et al.  Reverse Engineering of Gene Regulatory Networks: A Comparative Study , 2009, EURASIP J. Bioinform. Syst. Biol..

[35]  Wei-Po Lee,et al.  Computational methods for discovering gene networks from expression data , 2009, Briefings Bioinform..

[36]  A. Beyer,et al.  Detection and interpretation of expression quantitative trait loci (eQTL). , 2009, Methods.

[37]  R. Tibshirani,et al.  Regression shrinkage and selection via the lasso: a retrospective , 2011 .

[38]  R. Durbin,et al.  Joint Genetic Analysis of Gene Expression Data with Inferred Cellular Phenotypes , 2011, PLoS genetics.