Optimal Sparse Segment Identification With Application in Copy Number Variation Analysis

Motivated by DNA copy number variation (CNV) analysis based on high-density single nucleotide polymorphism (SNP) data, we consider the problem of detecting and identifying sparse short segments in a long one-dimensional sequence of data with additive Gaussian white noise, where the number, length, and location of the segments are unknown. We present a statistical characterization of the identifiable region of a segment where it is possible to reliably separate the segment from noise. An efficient likelihood ratio selection (LRS) procedure for identifying the segments is developed, and the asymptotic optimality of this method is presented in the sense that the LRS can separate the signal segments from the noise as long as the signal segments are in the identifiable regions. The proposed method is demonstrated with simulations and analysis of a real dataset on identification of copy number variants based on high-density SNP data. The results show that the LRS procedure can yield greater gain in power for detecting the true segments than some standard signal identification methods.

[1]  Xinge Jessie Jeng Variance adaptation and covariance regularization in sparse inference , 2009 .

[2]  E. Candès,et al.  Searching for a trail of evidence in a maze , 2007, math/0701668.

[3]  M. Hurles,et al.  Copy number variation in human health, disease, and evolution. , 2009, Annual review of genomics and human genetics.

[4]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[5]  P. K. Bhattacharya,et al.  Some aspects of change-point analysis , 1994 .

[6]  John A. Sweeney,et al.  Genome-Wide Analyses of Exonic Copy Number Variants in a Family-Based Study Point to Novel Autism Susceptibility Genes , 2009, PLoS genetics.

[7]  Anthony Stefanidis,et al.  DIFFERENTIAL SNAKES FOR CHANGE DETECTION IN ROAD SEGMENTS , 2001 .

[8]  S. Zacks SURVEY OF CLASSICAL AND BAYESIAN APPROACHES TO THE CHANGE-POINT PROBLEM: FIXED SAMPLE AND SEQUENTIAL PROCEDURES OF TESTING AND ESTIMATION11Research supported in part by ONR Contracts N00014-75-0725 at The George Washington University and N00014-81-K-0407 at SUNY-Binghamton. , 1983 .

[9]  L. Wasserman,et al.  Revisiting Marginal Regression , 2009, 0911.4080.

[10]  Robert A. Hegele,et al.  Copy Number Variation in the Human Genome and Its Implications for Cardiovascular Disease , 2007, Circulation.

[11]  Nancy R. Zhang,et al.  Detecting simultaneous changepoints in multiple sequences. , 2010, Biometrika.

[12]  David O Siegmund,et al.  A Modified Bayes Information Criterion with Applications to the Analysis of Comparative Genomic Hybridization Data , 2007, Biometrics.

[13]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[14]  G. Walther Optimal and fast detection of spatial clusters with scan statistics , 2010, 1002.4770.

[15]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[16]  D. Casasent,et al.  Detection of triple junction parameters in microscope images , 2001 .

[17]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[18]  K. Gunderson,et al.  High-resolution genomic profiling of chromosomal aberrations using Infinium whole-genome genotyping. , 2006, Genome research.

[19]  Edward Carlstein,et al.  Change-point problems , 1994 .

[20]  D. Altshuler,et al.  Completing the map of human genetic variation , 2007, Nature.

[21]  P. Hall,et al.  Innovated Higher Criticism for Detecting Sparse Signals in Correlated Noise , 2009, 0902.3837.

[22]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[23]  D. Donoho,et al.  Higher criticism thresholding: Optimal feature selection when useful features are rare and weak , 2008, Proceedings of the National Academy of Sciences.

[24]  D. Siegmund Detecting Simultaneous Change-points in Multiple Sequences , 2008 .

[25]  L. Feuk,et al.  Structural variation in the human genome , 2006, Nature Reviews Genetics.

[26]  Xiaoming Huo,et al.  Near-optimal detection of geometric objects by fast multiscale methods , 2005, IEEE Transactions on Information Theory.

[27]  Sharon J. Diskin,et al.  Copy number variation at 1q21.1 associated with neuroblastoma , 2009, Nature.

[28]  H. Lachman,et al.  Increase in GSK3β gene copy number variation in bipolar disorder , 2007, American journal of medical genetics. Part B, Neuropsychiatric genetics : the official publication of the International Society of Psychiatric Genetics.

[29]  Louise V Wain,et al.  Copy number variation. , 2011, Methods in molecular biology.

[30]  S. Mccarroll,et al.  Copy-number variation and association studies of human disease , 2007, Nature Genetics.

[31]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.