Segmentor3IsBack: an R package for the fast and exact segmentation of Seq-data

BackgroundChange point problems arise in many genomic analyses such as the detection of copy number variations or the detection of transcribed regions. The expanding Next Generation Sequencing technologies now allow to locate change points at the nucleotide resolution.ResultsBecause of its complexity which is almost linear in the sequence length when the maximal number of segments is constant, and as its performance had been acknowledged for microarrays, we propose to use the Pruned Dynamic Programming algorithm for Seq-experiment outputs. This requires the adaptation of the algorithm to the negative binomial distribution with which we model the data. We show that if the dispersion in the signal is known, the PDP algorithm can be used, and we provide an estimator for this dispersion. We describe a compression framework which reduces the time complexity without modifying the accuracy of the segmentation. We propose to estimate the number of segments via a penalized likelihood criterion. We illustrate the performance of the proposed methodology on RNA-Seq data.ConclusionsWe illustrate the results of our approach on a real dataset and show its good performance. Our algorithm is available as an R package on the CRAN repository.

[1]  J. Davis Univariate Discrete Distributions , 2006 .

[2]  A. W. Kemp,et al.  Univariate Discrete Distributions: Johnson/Univariate Discrete Distributions , 2005 .

[3]  David O Siegmund,et al.  A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data , 2007, Biometrics.

[4]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[5]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[6]  Zaïd Harchaoui,et al.  Catching Change-points with Lasso , 2007, NIPS.

[7]  Emilie Lebarbier,et al.  Segmentation of the Poisson and negative binomial rate models: a penalized estimator , 2013, 1301.2534.

[8]  E. Lebarbier,et al.  Estimating the joint distribution of independent categorical variables via model selection , 2009, 0906.2275.

[9]  Guillem Rigaill,et al.  Pruned dynamic programming for optimal multiple change-point detection , 2010 .

[10]  Yi-Ching Yao Estimation of a Noisy Discrete-Time Step Function: Bayes and Empirical Bayes Approaches , 1984 .

[11]  Nebojsa Jojic,et al.  Discovering Patterns in Biological Sequences by Optimal Segmentation , 2007, UAI.

[12]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[13]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[14]  J. J. Shen,et al.  Change-point model on nonhomogeneous Poisson processes with application in copy number profiling by next-generation DNA sequencing , 2012, 1206.6627.

[15]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[16]  H. Akaike A new look at the statistical model identification , 1974 .

[17]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[18]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[19]  Joseph Tadjuidje Kamgaing,et al.  Changepoints in times series of counts , 2012 .

[20]  Stéphane Robin,et al.  Joint segmentation, calling, and normalization of multiple CGH profiles. , 2011, Biostatistics.

[21]  Francis R. Bach,et al.  Learning smoothing models of copy number profiles using breakpoint annotations , 2013, BMC Bioinformatics.

[22]  Grégory Nuel,et al.  Fast estimation of posterior probabilities in change-point analysis through a constrained hidden Markov model , 2013, Comput. Stat. Data Anal..

[23]  Gregory Nuel,et al.  Fast estimation of posterior probabilities in change-point models through a constrained hidden Markov model , 2012, 1203.4394.

[24]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[25]  Yann Guédon,et al.  Exploring the latent segmentation space for the assessment of multiple change-point models , 2013, Comput. Stat..

[26]  Emmanuel Barillot,et al.  Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization , 2010, Bioinform..

[27]  Franck Picard,et al.  A statistical approach for array CGH data analysis , 2005, BMC Bioinformatics.

[28]  Kenneth Lange,et al.  Reconstructing DNA copy number by joint segmentation of multiple sequences , 2012, BMC Bioinformatics.

[29]  Guenther Walther,et al.  Optimal detection of a jump in the intensity of a Poisson process or in a density with likelihood ratio statistics , 2012, 1211.2859.

[30]  Emilie Lebarbier,et al.  Detecting multiple change-points in the mean of Gaussian process by model selection , 2005, Signal Process..

[31]  A. W. Kemp,et al.  Univariate Discrete Distributions , 1993 .

[32]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[33]  Chandra Erdman,et al.  A fast Bayesian change point analysis for the segmentation of microarray data , 2008, Bioinform..

[34]  Pascal Massart,et al.  Data-driven Calibration of Penalties for Least-Squares Regression , 2008, J. Mach. Learn. Res..