Generalised empirical Bayesian methods for discovery of differential data in high-throughput biology

Motivation High-throughput data are now commonplace in biological research. Rapidly changing technologies and application mean that novel methods for detecting differential behaviour that account for a ‘large P, small n’ setting are required at an increasing rate. The development of such methods is, in general, being done on an ad hoc basis, requiring further development cycles and a lack of standardization between analyses. Results We present here a generalised method for identifying differential behaviour within high-throughput biological data through empirical Bayesian methods. This approach is based on our baySeq algorithm for identification of differential expression in RNA-seq data based on a negative binomial distribution, and in paired data based on a beta-binomial distribution. Here we show how the same empirical Bayesian approach can be applied to any parametric distribution, removing the need for lengthy development of novel methods for differently distributed data. Comparisons with existing methods developed to address specific problems in high-throughput biological data show that these generic methods can achieve equivalent or better performance. A number of enhancements to the basic algorithm are also presented to increase flexibility and reduce computational costs. Availability The methods are implemented in the R baySeq (v2) package, available on Bioconductor http://www.bioconductor.org/packages/release/bioc/html/baySeq.html. Contact tjh48@cam.ac.uk

[1]  David A. Orlando,et al.  Revisiting Global Gene Expression Analysis , 2012, Cell.

[2]  Nicholas T. Ingolia,et al.  Genome-Wide Analysis in Vivo of Translation with Nucleotide Resolution Using Ribosome Profiling , 2009, Science.

[3]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[4]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[5]  J. Ibrahim,et al.  ZINBA integrates local covariates with DNA-seq data to identify broad and narrow regions of enrichment, even within amplified genomic regions , 2011, Genome Biology.

[6]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[7]  C. Mason,et al.  A rat RNA-Seq transcriptomic BodyMap across 11 organs and 4 developmental stages , 2014, Nature Communications.

[8]  Bing Ren,et al.  Discovery and Annotation of Functional Chromatin Signatures in the Human Genome , 2009, PLoS Comput. Biol..

[9]  Li Wang,et al.  Integrating Multi-Omics for Uncovering the Architecture of Cross-Talking Pathways in Breast Cancer , 2014, PloS one.

[10]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[11]  I. Johnstone,et al.  Statistical challenges of high-dimensional data , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[12]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[13]  M. Evans,et al.  Methods for Approximating Integrals in Statistics with Special Emphasis on Bayesian Integration Problems , 1995 .

[14]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[15]  Charlotte Soneson,et al.  A comparison of methods for differential expression analysis of RNA-seq data , 2013, BMC Bioinformatics.

[16]  Robert Nadon,et al.  Comparison of small n statistical tests of differential expression applied to microarrays , 2009, BMC Bioinformatics.

[17]  Sang Yup Lee,et al.  Comparative multi-omics systems analysis of Escherichia coli strains B and K-12 , 2012, Genome Biology.

[18]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[19]  Nicola Zamboni,et al.  High-throughput quantitative metabolomics: workflow for cultivation, quenching, and analysis of yeast in a multiwell format. , 2009, Analytical chemistry.

[20]  John R Yates,et al.  Mass spectrometry in high-throughput proteomics: ready for the big time , 2010, Nature Methods.

[21]  Kiyoshi Masuda,et al.  General RBP expression in human tissues as a function of age , 2012, Ageing Research Reviews.

[22]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[23]  N. Manley,et al.  An evolutionary perspective on the mechanisms of immunosenescence. , 2009, Trends in immunology.

[24]  Timothy J. Durham,et al.  Systematic analysis of chromatin state dynamics in nine human cell types , 2011, Nature.

[25]  Thomas J. Hardcastle,et al.  baySeq: Empirical Bayesian methods for identifying differential expression in sequence count data , 2010, BMC Bioinformatics.

[26]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[27]  Colin N. Dewey,et al.  RNA-Seq gene expression estimation with read mapping uncertainty , 2009, Bioinform..

[28]  Marco Beccuti,et al.  Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis , 2012, PloS one.

[29]  C. Morris Parametric Empirical Bayes Inference: Theory and Applications , 1983 .

[30]  Vanessa M Kvam,et al.  A comparison of statistical methods for detecting differentially expressed genes from RNA-seq data. , 2012, American journal of botany.

[31]  D. Berend,et al.  IMPROVED BOUNDS ON BELL NUMBERS AND ON MOMENTS OF SUMS OF RANDOM VARIABLES , 2000 .

[32]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[33]  X. Cui,et al.  Improved statistical tests for differential gene expression by shrinking variance components estimates. , 2005, Biostatistics.

[34]  Thomas J. Hardcastle,et al.  Empirical Bayesian analysis of paired high-throughput sequencing data with a beta-binomial distribution , 2013, BMC Bioinformatics.

[35]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[36]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[37]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[38]  David Gomez-Cabrero,et al.  Data integration in the era of omics: current and future challenges , 2014, BMC Systems Biology.

[39]  Andrew McDavid,et al.  Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments , 2012, Bioinform..

[40]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[41]  Ruedi Aebersold,et al.  Options and considerations when selecting a quantitative proteomics strategy , 2010, Nature Biotechnology.

[42]  A. W. van der Vaart,et al.  Bayesian analysis of RNA sequencing data by estimating multiple shrinkage priors. , 2013, Biostatistics.

[43]  S. Bergmann,et al.  The evolution of gene expression levels in mammalian organs , 2011, Nature.

[44]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[45]  Brandie D. Wagner,et al.  Application of zero-inflated negative binomial mixed model to human microbiota sequence data , 2014 .

[46]  S. Linnarsson,et al.  Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. , 2011, Genome research.