Robust adjustment of sequence tag abundance

MOTIVATION The majority of next-generation sequencing technologies effectively sample small amounts of DNA or RNA that are amplified (i.e. copied) before sequencing. The amplification process is not perfect, leading to extreme bias in sequenced read counts. We present a novel procedure to account for amplification bias and demonstrate its effectiveness in mitigating gene length dependence when estimating true gene expression. RESULTS We tested the proposed method on simulated and real data. Simulations indicated that our method captures true gene expression more effectively than classic censoring-based approaches and leads to power gains in differential expression testing, particularly for shorter genes with high transcription rates. We applied our method to an unreplicated Arabidopsis RNA-seq dataset resulting in disparate gene ontologies arising from gene set enrichment analyses. AVAILABILITY AND IMPLEMENTATION R code to perform the RASTA procedures is freely available on the web at www.stat.purdue.edu/∼doerge/.

[1]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[2]  T. Yee The VGAM Package for Categorical Data Analysis , 2010 .

[3]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[4]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[5]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[6]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[7]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[8]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[9]  R. Lister,et al.  Highly Integrated Single-Base Resolution Maps of the Epigenome in Arabidopsis , 2008, Cell.

[10]  G. Lynch,et al.  The Control of the False Discovery Rate in Fixed Sequence Multiple Testing , 2016, 1611.03146.

[11]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[12]  Zhou Du,et al.  agriGO: a GO analysis toolkit for the agricultural community , 2010, Nucleic Acids Res..

[13]  Orion J. Buske,et al.  iReckon: Simultaneous isoform discovery and abundance estimation from RNA-seq data , 2013, Genome research.

[14]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[15]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[16]  K. Zhao,et al.  Detection of single nucleotide variations in expressed exons of the human genome using RNA-Seq , 2009, Nucleic acids research.

[17]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[18]  J. Minna,et al.  DNA methylation in health, disease, and cancer. , 2007, Current molecular medicine.

[19]  R. Doerge,et al.  Statistical Applications in Genetics and Molecular Biology A Two-Stage Poisson Model for Testing RNA-Seq Data , 2011 .

[20]  Tony Z. Jia,et al.  Digital RNA sequencing minimizes sequence-dependent bias and amplification noise with optimized single-molecule barcodes , 2012, Proceedings of the National Academy of Sciences.

[21]  C. Wild,et al.  Vector Generalized Additive Models , 1996 .

[22]  Susan M. Bridges,et al.  Comparing gene annotation enrichment tools for functional modeling of agricultural microarray data , 2009, BMC Bioinformatics.

[23]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[24]  Arthur D. Riggs,et al.  X inactivation, differentiation, and DNA methylation. , 1975, Cytogenetics and cell genetics.

[25]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[26]  S. Bennett Solexa Ltd. , 2004, Pharmacogenomics.

[27]  K. Robertson DNA methylation and human disease , 2005, Nature Reviews Genetics.

[28]  K. Mullis,et al.  Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase. , 1988, Science.

[29]  G. N. Lance,et al.  Computer Programs for Hierarchical Polythetic Classification ("Similarity Analyses") , 1966, Comput. J..