Robust detection and identification of sparse segments in ultrahigh dimensional data analysis

Copy number variants (CNVs) are alternations of DNA of a genome that results in the cell having a less or more than two copies of segments of the DNA. CNVs correspond to relatively large regions of the genome, ranging from about one kilobase to several megabases, that are deleted or duplicated. Motivated by CNV analysis based on next generation sequencing data, we consider the problem of detecting and identifying sparse short segments hidden in a long linear sequence of data with an unspecified noise distribution. We propose a computationally efficient method that provides a robust and near-optimal solution for segment identification over a wide range of noise distributions. We theoretically quantify the conditions for detecting the segment signals and show that the method near-optimally estimates the signal segments whenever it is possible to detect their existence. Simulation studies are carried out to demonstrate the efficiency of the method under different noise distributions. We present results from a CNV analysis of a HapMap Yoruban sample to further illustrate the theory and the methods.

[1]  Hanlee P. Ji,et al.  Next-generation DNA sequencing , 2008, Nature Biotechnology.

[2]  Hongzhe Li,et al.  Optimal Sparse Segment Identification With Application in Copy Number Variation Analysis , 2010, Journal of the American Statistical Association.

[3]  Thomas W. Mühleisen,et al.  Large recurrent microdeletions associated with schizophrenia , 2008, Nature.

[4]  S. Mccarroll,et al.  Copy-number variation and association studies of human disease , 2007, Nature Genetics.

[5]  Huanming Yang,et al.  Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly , 2011, Nature Biotechnology.

[6]  Jiashun Jin,et al.  Optimal detection of heterogeneous and heteroscedastic mixtures , 2011 .

[7]  Simon Tavaré,et al.  CNAseg - a novel framework for identification of copy number changes in cancer from second-generation sequencing data , 2010, Bioinform..

[8]  J. Lupski,et al.  Schizophrenia: Incriminating genomic evidence , 2008, Nature.

[9]  Tom Walsh,et al.  Accurate and exact CNV identification from targeted high-throughput sequence data , 2011, BMC Genomics.

[10]  Harrison H. Zhou,et al.  Robust nonparametric estimation via wavelet median regression , 2008, 0810.4802.

[11]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[12]  M. Hurles,et al.  Copy number variation in human health, disease, and evolution. , 2009, Annual review of genomics and human genetics.

[13]  Kenny Q. Ye,et al.  Large-Scale Copy Number Polymorphism in the Human Genome , 2004, Science.

[14]  Xiaoming Huo,et al.  Near-optimal detection of geometric objects by fast multiscale methods , 2005, IEEE Transactions on Information Theory.

[15]  J. Lupski Structural variation in the human genome. , 2007, The New England journal of medicine.

[16]  Peter J. Park,et al.  rSW-seq: Algorithm for detection of copy number alterations in deep sequencing data , 2010, BMC Bioinformatics.

[17]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[18]  Sharon J. Diskin,et al.  Copy number variation at 1q21.1 associated with neuroblastoma , 2009, Nature.

[19]  J. Ahringer,et al.  Systematic bias in high-throughput sequencing data and its correction by BEADS , 2011, Nucleic acids research.

[20]  Harrison H. Zhou A Note on Quantile Coupling Inequalities and Their Applications , 2006 .

[21]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[22]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[23]  P. Visscher,et al.  Rare chromosomal deletions and duplications increase risk of schizophrenia , 2008, Nature.

[24]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[25]  Alexander Eckehart Urban,et al.  High-resolution mapping of DNA copy alterations in human chromosome 22 using high-density tiling oligonucleotide arrays. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[26]  T. Tony Cai,et al.  Asymptotic Equivalence and Adaptive Estimation for Robust Nonparametric Regression , 2009, 0909.0343.

[27]  W. Wong,et al.  Modeling non-uniformity in short-read rates in RNA-Seq data , 2010, Genome Biology.

[28]  G. Walther Optimal and fast detection of spatial clusters with scan statistics , 2010, 1002.4770.

[29]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[30]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[31]  Héctor Corrada Bravo,et al.  Model-based quality assessment and base-calling for second-generation sequencing data. , 2010, Biometrics.

[32]  A. Singleton,et al.  Rare Structural Variants Disrupt Multiple Genes in Neurodevelopmental Pathways in Schizophrenia , 2008, Science.

[33]  Kenny Q. Ye,et al.  Mapping copy number variation by population scale genome sequencing , 2010, Nature.

[34]  Bradley P. Coe,et al.  Genome structural variation discovery and genotyping , 2011, Nature Reviews Genetics.

[35]  John Quackenbush Microarray data normalization and transformation , 2002, Nature Genetics.

[36]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[37]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.