Multiple Break-Points Detection in Array CGH Data via the Cross-Entropy Method

Array comparative genome hybridization (aCGH) is a widely used methodology to detect copy number variations of a genome in high resolution. Knowing the number of break-points and their corresponding locations in genomic sequences serves different biological needs. Primarily, it helps to identify disease-causing genes that have functional importance in characterizing genome wide diseases. For human autosomes the normal copy number is two, whereas at the sites of oncogenes it increases (gain of DNA) and at the tumour suppressor genes it decreases (loss of DNA). The majority of the current detection methods are deterministic in their set-up and use dynamic programming or different smoothing techniques to obtain the estimates of copy number variations. These approaches limit the search space of the problem due to different assumptions considered in the methods and do not represent the true nature of the uncertainty associated with the unknown break-points in genomic sequences. We propose the Cross-Entropy method, which is a model-based stochastic optimization technique as an exact search method, to estimate both the number and locations of the break-points in aCGH data. We model the continuous scale log-ratio data obtained by the aCGH technique as a multiple break-point problem. The proposed methodology is compared with well established publicly available methods using both artificially generated data and real data. Results show that the proposed procedure is an effective way of estimating number and especially the locations of break-points with high level of precision. Availability: The methods described in this article are implemented in the new R package breakpoint and it is available from the Comprehensive R Archive Network at http://CRAN.R-project.org/package=breakpoint.

[1]  Tao Huang,et al.  Detection of DNA copy number alterations using penalized least squares regression , 2005, Bioinform..

[2]  Thomas Koschny,et al.  Comparative genomic hybridization in glioma: a meta-analysis of 509 cases. , 2002, Cancer genetics and cytogenetics.

[3]  J. Hartigan,et al.  Product Partition Models for Change Point Problems , 1992 .

[4]  Stephen Weston,et al.  Foreach Parallel Adaptor for the 'snow' Package , 2015 .

[5]  Philippe Froguel,et al.  FCGR3B copy number variation is associated with susceptibility to systemic, but not organ-specific, autoimmunity , 2007, Nature Genetics.

[6]  H. Müller,et al.  Statistical methods for DNA sequence segmentation , 1998 .

[7]  Gareth E. Evans,et al.  Identifying Change-Points in Biological Sequences via Sequential Importance Sampling , 2009 .

[8]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[9]  M. Ringnér,et al.  Impact of DNA amplification on gene expression patterns in breast cancer. , 2002, Cancer research.

[10]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[11]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[12]  Vito M. R. Muggeo,et al.  Efficient change point detection for genomic sequences of continuous measurements , 2011, Bioinform..

[13]  B. Rovin,et al.  The Influence of CCL 3 L 1 Gene – Containing Segmental Duplications on HIV-1 / AIDS Susceptibility , 2009 .

[14]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[15]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[16]  David O Siegmund,et al.  A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data , 2007, Biometrics.

[17]  D. Albertson,et al.  Chromosome aberrations in solid tumors , 2003, Nature Genetics.

[18]  L. Margolin,et al.  On the Convergence of the Cross-Entropy Method , 2005, Ann. Oper. Res..

[19]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[20]  Jake K. Byrnes,et al.  Genome-wide association study of copy number variation in 16,000 cases of eight common diseases and 3,000 shared controls , 2010 .

[21]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[22]  Peter J. Park,et al.  Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data , 2005, Bioinform..

[23]  Gareth E. Evans,et al.  Estimating change-points in biological sequences via the cross-entropy method , 2011, Ann. Oper. Res..

[24]  L. Recht,et al.  High-resolution genome-wide mapping of genetic alterations in human glial brain tumors. , 2005, Cancer research.

[25]  George Y. Sofronov,et al.  A modified cross entropy method for detecting multiple change points in DNA Count Data , 2012, 2012 IEEE Congress on Evolutionary Computation.

[26]  Michael Stuart,et al.  Understanding Robust and Exploratory Data Analysis , 1984 .

[27]  Lih-Yuan Deng,et al.  The Cross-Entropy Method: A Unified Approach to Combinatorial Optimization, Monte-Carlo Simulation, and Machine Learning , 2006, Technometrics.

[28]  H. Müller,et al.  Multiple changepoint fitting via quasilikelihood, with application to DNA sequence segmentation , 2000 .

[29]  Chandra Erdman,et al.  A fast Bayesian change point analysis for the segmentation of microarray data , 2008, Bioinform..

[30]  Georgy Sofronov,et al.  A hybrid genetic algorithm for change-point detection in binary biomolecular sequences , 2013 .

[31]  P. Fearnhead,et al.  Optimal detection of changepoints with a linear computational cost , 2011, 1101.1438.

[32]  Dirk P. Kroese,et al.  The Cross Entropy Method: A Unified Approach To Combinatorial Optimization, Monte-carlo Simulation (Information Science and Statistics) , 2004 .

[33]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[34]  Christian A. Rees,et al.  Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[36]  Dirk P. Kroese,et al.  Convergence properties of the cross-entropy method for discrete optimization , 2007, Oper. Res. Lett..

[37]  Ajay N. Jain,et al.  Assembly of microarrays for genome-wide measurement of DNA copy number , 2001, Nature Genetics.

[38]  Jonathan Flint,et al.  Subtle chromosomal rearrangements in children with unexplained mental retardation , 1999, The Lancet.

[39]  N. Carter Methods and strategies for analyzing copy number variation using DNA microarrays , 2007, Nature Genetics.

[40]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[41]  G. Sofronov,et al.  Sequential change-point detection via the Cross-Entropy method , 2012, 11th Symposium on Neural Network Applications in Electrical Engineering.