Bias correction and Bayesian analysis of aggregate counts in SAGE libraries

BackgroundTag-based techniques, such as SAGE, are commonly used to sample the mRNA pool of an organism's transcriptome. Incomplete digestion during the tag formation process may allow for multiple tags to be generated from a given mRNA transcript. The probability of forming a tag varies with its relative location. As a result, the observed tag counts represent a biased sample of the actual transcript pool. In SAGE this bias can be avoided by ignoring all but the 3' most tag but will discard a large fraction of the observed data. Taking this bias into account should allow more of the available data to be used leading to increased statistical power.ResultsThree new hierarchical models, which directly embed a model for the variation in tag formation probability, are proposed and their associated Bayesian inference algorithms are developed. These models may be applied to libraries at both the tag and aggregate level. Simulation experiments and analysis of real data are used to contrast the accuracy of the various methods. The consequences of tag formation bias are discussed in the context of testing differential expression. A description is given as to how these algorithms can be applied in that context.ConclusionsSeveral Bayesian inference algorithms that account for tag formation effects are compared with the DPB algorithm providing clear evidence of superior performance. The accuracy of inferences when using a particular non-informative prior is found to depend on the expression level of a given gene. The multivariate nature of the approach easily allows both univariate and joint tests of differential expression. Calculations demonstrate the potential for false positive and negative findings due to variation in tag formation probabilities across samples when testing for differential expression.

[1]  Wei Zhou,et al.  Characterization of the Yeast Transcriptome , 1997, Cell.

[2]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[3]  Jun Lu,et al.  BMC Bioinformatics BioMed Central Methodology article Identifying differential expression in multiple SAGE libraries: an , 2005 .

[4]  R. Vossen,et al.  Deep sequencing-based expression analysis shows major advances in robustness, resolution and inter-lab portability over five microarray platforms , 2008, Nucleic acids research.

[5]  Hong Qin,et al.  Modeling SAGE tag formation and its effects on data interpretation within a Bayesian framework , 2007, BMC Bioinformatics.

[6]  Ricardo Z. N. Vêncio,et al.  Statistical Methods in Serial Analysis of Gene Expression (Sage) , 2006 .

[7]  D. Lindley,et al.  Bayes Estimates for the Linear Model , 1972 .

[8]  E Pauws,et al.  Heterogeneity in polyadenylation cleavage sites in mammalian mRNA sequences: implications for SAGE analysis. , 2001, Nucleic acids research.

[9]  Li Deng,et al.  Differential expression in SAGE: accounting for normal between-library variation , 2003, Bioinform..

[10]  Ji Huang,et al.  [Serial analysis of gene expression]. , 2002, Yi chuan = Hereditas.

[11]  V. Kuznetsov,et al.  General statistics of stochastic process of gene expression in eukaryotic cells. , 2002, Genetics.

[12]  Aeilko H. Zwinderman,et al.  Modeling Sage data with a truncated gamma-Poisson model , 2006, BMC Bioinformatics.

[13]  Christian P. Robert,et al.  On Bayesian Data Analysis , 2010, 1001.4656.

[14]  Ricardo Z. N. Vêncio,et al.  Bayesian model accounting for within-class biological variability in Serial Analysis of Gene Expression (SAGE) , 2004, BMC Bioinformatics.

[15]  Rutger van Haasteren,et al.  Gibbs Sampling , 2010, Encyclopedia of Machine Learning.

[16]  BMC Bioinformatics , 2005 .

[17]  Günter Kahl,et al.  SuperSAGE array: the direct use of 26-base-pair transcript tags in oligonucleotide arrays , 2006, Nature Methods.

[18]  Kenneth H Buetow,et al.  An anatomy of normal and malignant gene expression , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[19]  I. Shmulevich,et al.  Computational and Statistical Approaches to Genomics , 2007, Springer US.

[20]  Calyampudi R. Rao Handbook of statistics , 1980 .

[21]  V. Kuznetsov Statistics of the Numbers of Transcripts and Protein Sequences Encoded in the Genome , 2003 .

[22]  W. Michael Conklin,et al.  Monte Carlo Methods in Bayesian Computation , 2001, Technometrics.

[23]  Peter Winter,et al.  Gene expression analysis of plant host–pathogen interactions by SuperSAGE , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[25]  Rodrigo Malig,et al.  Accurate and unambiguous tag-to-gene mapping in serial analysis of gene expression , 2006, BMC Bioinformatics.

[26]  Li Deng,et al.  Overdispersed logistic regression for SAGE: Modelling multiple groups and covariates , 2004, BMC Bioinformatics.

[27]  Jeffrey S. Morris,et al.  Bayesian Shrinkage Estimation of the Relative Abundance of mRNA Transcripts Using SAGE , 2003, Biometrics.

[28]  Chiara Romualdi,et al.  IDEG6: a web tool for detection of differentially expressed genes in multiple tag sampling experiments. , 2003, Physiological genomics.