Bayesian Correlation Analysis for Sequence Count Data

Evaluating the similarity of different measured variables is a fundamental task of statistics, and a key part of many bioinformatics algorithms. Here we propose a Bayesian scheme for estimating the correlation between different entities’ measurements based on high-throughput sequencing data. These entities could be different genes or miRNAs whose expression is measured by RNA-seq, different transcription factors or histone marks whose expression is measured by ChIP-seq, or even combinations of different types of entities. Our Bayesian formulation accounts for both measured signal levels and uncertainty in those levels, due to varying sequencing depth in different experiments and to varying absolute levels of individual entities, both of which affect the precision of the measurements. In comparison with a traditional Pearson correlation analysis, we show that our Bayesian correlation analysis retains high correlations when measurement confidence is high, but suppresses correlations when measurement confidence is low—especially for entities with low signal levels. In addition, we consider the influence of priors on the Bayesian correlation estimate. Perhaps surprisingly, we show that naive, uniform priors on entities’ signal levels can lead to highly biased correlation estimates, particularly when different experiments have widely varying sequencing depths. However, we propose two alternative priors that provably mitigate this problem. We also prove that, like traditional Pearson correlation, our Bayesian correlation calculation constitutes a kernel in the machine learning sense, and thus can be used as a similarity measure in any kernel-based machine learning algorithm. We demonstrate our approach on two RNA-seq datasets and one miRNA-seq dataset.

[1]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[2]  J. D. Engel,et al.  Developmental transcriptome analysis of human erythropoiesis. , 2014, Human molecular genetics.

[3]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[4]  Ning Leng,et al.  EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments , 2013, Bioinform..

[5]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[6]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[7]  Raymond K. Auerbach,et al.  A User's Guide to the Encyclopedia of DNA Elements (ENCODE) , 2011, PLoS biology.

[8]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[9]  K. Voelkerding,et al.  Next-generation sequencing: from basic research to diagnostics. , 2009, Clinical chemistry.

[10]  Eric T. Wang,et al.  Alternative Isoform Regulation in Human Tissue Transcriptomes , 2008, Nature.

[11]  X. Chen,et al.  Characterization of microRNAs in serum: a novel class of biomarkers for diagnosis of cancer and other diseases , 2008, Cell Research.

[12]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[13]  E. Mardis The impact of next-generation sequencing technology on genetics. , 2008, Trends in genetics : TIG.

[14]  T. Mikkelsen,et al.  Genome-wide maps of chromatin state in pluripotent and lineage-committed cells , 2007, Nature.

[15]  C. Sander,et al.  A Mammalian microRNA Expression Atlas Based on Small RNA Library Sequencing , 2007, Cell.

[16]  C. Benz,et al.  Optimized high-throughput microRNA expression profiling provides novel biomarker assessment of clinical prostate and breast cancer biopsies , 2006, Molecular Cancer.

[17]  Chris Wiggins,et al.  ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context , 2004, BMC Bioinformatics.

[18]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[19]  Homin K. Lee,et al.  Coexpression analysis of human genes across many microarray data sets. , 2004, Genome research.

[20]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[21]  A. Butte,et al.  Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[22]  I S Kohane,et al.  Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[23]  J. Claverie,et al.  The significance of digital gene expression profiles. , 1997, Genome research.

[24]  Michael Rabadi,et al.  Kernel Methods for Machine Learning , 2015 .

[25]  Yuan Qi,et al.  Integrated Genomic Analysis Identifies Clinically Relevant Subtypes of Glioblastoma Characterized by Abnormalities in PDGFRA , IDH 1 , EGFR , and NF 1 Citation Verhaak , 2010 .

[26]  Bernhard Schölkopf,et al.  Kernel Methods in Computational Biology , 2005 .

[27]  N. Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .