which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets

High-throughput sequencing assays such as RNA-Seq, ChIP-Seq or barcode counting provide quantitative readouts in the form of count data. To infer differential signal in such data correctly and with good statistical power, estimation of data variability throughout the dynamic range and a suitable error model are required. We propose a method based on the negative binomial distribution, with variance and mean linked by local regression and present an implementation, DESeq, as an R/Bioconductor package.

[1]  L. Whitaker,et al.  ON THE POISSON LAW OF SMALL NUMBERS , 1914 .

[2]  C. I. Bliss,et al.  FITTING THE NEGATIVE BINOMIAL DISTRIBUTION TO BIOLOGICAL DATA AND NOTE ON THE EFFICIENT FITTING OF THE NEGATIVE BINOMIAL , 1953 .

[3]  J. Lawless Negative binomial and mixed Poisson regression , 1987 .

[4]  J. L. Hodges,et al.  Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties , 1989 .

[5]  P. McCullagh,et al.  Generalized Linear Models , 1992 .

[6]  Joe N. Perry,et al.  Estimation of the Negative Binomial Parameter κ by Maximum Quasi -Likelihood , 1989 .

[7]  A. Agresti Categorical data analysis , 1993 .

[8]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[9]  D. Cvetkovic,et al.  Spectra of Graphs: Theory and Applications , 1997 .

[10]  R. Tibshirani,et al.  Improvements on Cross-Validation: The 632+ Bootstrap Method , 1997 .

[11]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[12]  Pravin K. Trivedi,et al.  Regression Analysis of Count Data , 1998 .

[13]  H. Hofmann Exploring categorical data: interactive mosaic plots , 2000 .

[14]  Guohua Pan,et al.  Local Regression and Likelihood , 1999, Technometrics.

[15]  Ingrid Lönnstedt Replicated microarray data , 2001 .

[16]  M. Wand Local Regression and Likelihood , 2001 .

[17]  J. T. Wulu,et al.  Regression analysis of count data , 2002 .

[18]  Dmitri V Zaykin,et al.  Multiple tests for genetic effects in association studies. , 2002, Methods in molecular biology.

[19]  D. Hand,et al.  Local Versus Global Models for Classification Problems , 2003 .

[20]  Gareth James,et al.  Variance and Bias for General Loss Functions , 2003, Machine Learning.

[21]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[22]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[23]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[24]  Gerhard Tutz,et al.  Localized classification , 2005, Stat. Comput..

[25]  Sudhir Paul,et al.  Bias-corrected maximum likelihood estimator of the negative binomial dispersion parameter. , 2005, Biometrics.

[26]  Friedrich Leisch,et al.  A toolbox for K-centroids cluster analysis , 2006 .

[27]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[28]  Claus Weihs,et al.  Localized Linear Discriminant Analysis , 2006, GfKl.

[29]  Masashi Sugiyama,et al.  Dimensionality Reduction of Multimodal Labeled Data by Local Fisher Discriminant Analysis , 2007, J. Mach. Learn. Res..

[30]  Allen D. Delaney,et al.  Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing , 2007, Nature Methods.

[31]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[32]  Mark D. Robinson,et al.  Moderated statistical tests for assessing differences in tag abundance , 2007, Bioinform..

[33]  Chun-Xia Zhang,et al.  A local boosting algorithm for solving classification problems , 2008, Comput. Stat. Data Anal..

[34]  M. Stephens,et al.  RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. , 2008, Genome research.

[35]  Tyson A. Clark,et al.  HITS-CLIP yields genome-wide insights into brain alternative RNA processing , 2008, Nature.

[36]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[37]  M. Robinson,et al.  Small-sample estimation of negative binomial dispersion, with applications to SAGE data. , 2007, Biostatistics.

[38]  Lawrence Hubert,et al.  Order-Constrained Solutions in K-Means Clustering: Even Better Than Being Globally Optimal , 2008 .

[39]  Mona Singh,et al.  Measuring differential gene expression by short read sequencing: quantitative comparison to 2-channel gene expression microarrays , 2009, BMC Genomics.

[40]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[41]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[42]  Wing Hung Wong,et al.  Statistical inferences for isoform expression in RNA-Seq , 2009, Bioinform..

[43]  Ryan D. Morin,et al.  Next-generation tag sequencing for cancer gene expression profiling. , 2009, Genome research.

[44]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[45]  G. Giaever,et al.  Quantitative Phenotyping via Deep Barcode Sequencing , 2022 .

[46]  Berthold Lausen,et al.  Bootstrap estimated true and false positive rates and ROC curve , 2008 .

[47]  Charlie Hodgman,et al.  Inference of Gene Regulatory Networks Using Boolean-Network Inference Methods , 2009, J. Bioinform. Comput. Biol..

[48]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[49]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[50]  I. Gorlov,et al.  Prioritizing genes associated with prostate cancer development , 2010, BMC Cancer.

[51]  Enrico Blanzieri,et al.  Fast and Scalable Local Kernel Machines , 2010, J. Mach. Learn. Res..

[52]  Xuegong Zhang,et al.  DEGseq: an R package for identifying differentially expressed genes from RNA-seq data , 2010, Bioinform..

[53]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[54]  A. Harris,et al.  Large meta-analysis of multiple cancers reveals a common, compact and highly prognostic hypoxia metagene , 2010, British Journal of Cancer.

[55]  M. Gerstein,et al.  Variation in Transcription Factor Binding Among Humans , 2010, Science.

[56]  Sami Kilpinen,et al.  GTI: A Novel Algorithm for Identifying Outlier Gene Expression Profiles from Integrated Microarray Datasets , 2011, PloS one.