A modified cross entropy method for detecting multiple change points in DNA Count Data

We model DNA count data as a multiple change point problem, in which the data are divided in to different segments by an unknown number of change points. Each segment is supposed to be generated by unique distribution characteristics inherent to the underlying process. In this paper, we propose a modified version of the Cross-Entropy (CE) method, which utilizes Beta distribution to simulate locations of change points. Several stopping criterions are also discussed. The proposed CE method applies on over-dispersed count data, in which the observations are distributed as independent Negative Binomial. Furthermore, we incorporate the Bayesian Information Criterion to identify the optimal number of change points within the CE method while not fixing the maximum number of change points in the data sequence. We obtain estimates for the artificial data by using the modified CE method and compare the results with the general CE method, which utilizes normal distribution to simulate locations of the change points. The methods are applied to a real DNA count data set in order to illustrate the usefulness of the proposed modified CE method.

[1]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[2]  Chandra Erdman,et al.  A fast Bayesian change point analysis for the segmentation of microarray data , 2008, Bioinform..

[3]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[4]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[5]  D. Pinkel,et al.  Comparative Genomic Hybridization for Molecular Cytogenetic Analysis of Solid Tumors , 2022 .

[6]  Robert Lund,et al.  Multiple Changepoint Detection via Genetic Algorithms , 2012 .

[7]  F. J. Anscombe,et al.  The statistical analysis of insect counts based on the negative binomial distribution. , 1949, Biometrics.

[8]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[9]  David O Siegmund,et al.  A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data , 2007, Biometrics.

[10]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[11]  Gareth E. Evans,et al.  Identifying Change-Points in Biological Sequences via Sequential Importance Sampling , 2009 .

[12]  Georgy Sofronov Change-point modelling in biological sequences via the bayesian adaptive independent sampler , 2011 .

[13]  Anscombe Fj The statistical analysis of insect counts based on the negative binomial distribution. , 1949 .

[14]  J. Hartigan,et al.  A Bayesian Analysis for Change Point Problems , 1993 .

[15]  Georgy Sofronov,et al.  Change-point detection in biological sequences via genetic algorithm , 2011, 2011 IEEE Congress of Evolutionary Computation (CEC).

[16]  Gareth E. Evans,et al.  Estimating change-points in biological sequences via the cross-entropy method , 2011, Ann. Oper. Res..

[17]  Simon Tavaré,et al.  CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data , 2010, Bioinform..

[18]  Yi-Ching Yao Estimating the number of change-points via Schwarz' criterion , 1988 .