Detection copy number variants profile by multiple constrained optimization

Copy number variation, causing by the genome rearrangement, generally refers to the copy numbers increased or decreased of large genome segments whose lengths are more than 1kb. Such copy number variations mainly appeared as the submicroscopic level of deletion and duplication. Copy number variation is an important component of genome structural variation, and is one of pathogenic factors of human diseases. Next generation sequencing technology is a popular CNV detection method and it has been widely used in various fields of life science research. It possesses the advantages of high throughput and low cost. By tailoring NGS technology, it is plausible to sequence individual cells. Such single cell sequencing can reveal the gene expression status and genomic variation profile of a single-cell. Single cell sequencing is promising in the study of tumor, developmental biology, neuroscience and other fields. However, there are two challenging problems encountered in CNV detection for NGS data. The first one is that since single-cell sequencing requires a special genome amplification step to accumulate enough samples, a large number of bias is introduced, making the calling of copy number variants rather challenging. The performances of many popular copy number calling methods, designed for bulk sequencings, are not consistent and can not be applied on single-cell sequenced data directly. The second one is to simultaneously analyze genome data for multiple samples, thus achieving assembling and subgrouping similar cells accurately and efficiently. The high level of noises in single-cell-sequencing data negatively affects the reliability of sequence reads and leads to inaccurate patterns of variations. To handle the problem of reliably finding CNVs in NGS data, in this thesis, we firstly establish a workflow for analyzing NGS and single-cell sequencing data. The CNVs identification is formulated as a quadratic optimization problem

[1]  E. Eichler,et al.  Fine-scale structural variation of the human genome , 2005, Nature Genetics.

[2]  R. Wilson,et al.  BreakDancer: An algorithm for high resolution mapping of genomic structural variation , 2009, Nature Methods.

[3]  Charles Lee,et al.  Copy number variations and clinical cytogenetic diagnosis of constitutional disorders , 2007, Nature Genetics.

[4]  Nicholas W. Wood,et al.  A robust model for read count data in exome sequencing experiments and implications for copy number variant calling , 2012, Bioinform..

[5]  Thomas W. Mühleisen,et al.  Large recurrent microdeletions associated with schizophrenia , 2008, Nature.

[6]  R. Tibshirani,et al.  Spatial smoothing and hot spot detection for CGH data using the fused lasso. , 2008, Biostatistics.

[7]  Mark Gerstein,et al.  AGE: defining breakpoints of genomic structural variants at single-nucleotide resolution, through optimal alignments with gap excision , 2011, Bioinform..

[8]  R. J. Simpson,et al.  Isothermal whole genome amplification from single and small numbers of cells: a new era for preimplantation genetic diagnosis of inherited disease. , 2004, Molecular human reproduction.

[9]  Y. Teo,et al.  Genome wide association studies (GWAS) and copy number variation (CNV) studies of the major psychoses: What have we learnt? , 2012, Neuroscience & Biobehavioral Reviews.

[10]  Kenny Q. Ye,et al.  Strong Association of De Novo Copy Number Mutations with Autism , 2007, Science.

[11]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[12]  Bo Xu,et al.  Detection Copy Number Variants from NGS with Sparse and Smooth Constraints , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[13]  MengChu Zhou,et al.  Group Role Assignment via a Kuhn–Munkres Algorithm-Based Solution , 2012, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[14]  Yong-shu He,et al.  [Structural variation in the human genome]. , 2009, Yi chuan = Hereditas.

[15]  MengChu Zhou,et al.  Efficient Role Transfer Based on Kuhn–Munkres Algorithm , 2012, IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans.

[16]  F. Sanger,et al.  Sequences, sequences, and sequences. , 1988, Annual review of biochemistry.

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  Nicolas Vayatis,et al.  Estimation of Simultaneously Sparse and Low Rank Matrices , 2012, ICML.

[19]  Ali Bashir,et al.  A geometric approach for classification and comparison of structural variants , 2009, Bioinform..

[20]  Antony V. Cox,et al.  Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing , 2008, Nature Genetics.

[21]  Kenneth Lange,et al.  Reconstructing DNA copy number by joint segmentation of multiple sequences , 2012, BMC Bioinformatics.

[22]  Michael Wigler,et al.  Genome-wide copy number analysis of single cells , 2012, Nature Protocols.

[23]  Junyang Qian,et al.  On stepwise pattern recovery of the fused Lasso , 2016, Comput. Stat. Data Anal..

[24]  John Wei,et al.  Towards a comprehensive structural variation map of an individual human genome , 2010, Genome Biology.

[25]  Süleyman Cenk Sahinalp,et al.  Combinatorial Algorithms for Structural Variation Detection in High Throughput Sequenced Genomes , 2009, RECOMB.

[26]  J. Pollack,et al.  Amplification of whole tumor genomes and gene-by-gene mapping of genomic aberrations from limited sources of fresh-frozen and paraffin-embedded DNA. , 2005, The Journal of molecular diagnostics : JMD.

[27]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[28]  Mehmet Deveci,et al.  A comparative analysis of biclustering algorithms for gene expression data , 2013, Briefings Bioinform..

[29]  Patrik O. Hoyer,et al.  Non-negative Matrix Factorization with Sparseness Constraints , 2004, J. Mach. Learn. Res..

[30]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[31]  Michael K. Ng,et al.  Solving Constrained Total-variation Image Restoration and Reconstruction Problems via Alternating Direction Methods , 2010, SIAM J. Sci. Comput..

[32]  Ken Chen,et al.  CMDS: a population-based method for identifying recurrent DNA copy number aberrations in cancer from high-resolution data , 2010, Bioinform..

[33]  D. Campion,et al.  APP locus duplication causes autosomal dominant early-onset Alzheimer disease with cerebral amyloid angiopathy , 2006, Nature Genetics.

[34]  Jiming Liu,et al.  Piecewise-constant and low-rank approximation for identification of recurrent copy number variations , 2014, Bioinform..

[35]  Stephen P. Boyd,et al.  Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers , 2011, Found. Trends Mach. Learn..

[36]  Judy H Cho,et al.  Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease , 2008, Nature Genetics.

[37]  Tengpeng Li,et al.  An Empirical Algorithm for Bias Correction Based on GC Estimation for Single Cell Sequencing , 2014, PAKDD Workshops.

[38]  Seungjin Choi,et al.  Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds , 2010, Inf. Process. Manag..

[39]  Leon Wenliang Zhong,et al.  Accelerated Stochastic Gradient Method for Composite Regularization , 2014, AISTATS.

[40]  Huanming Yang,et al.  Single-Cell Exome Sequencing Reveals Single-Nucleotide Mutation Characteristics of a Kidney Tumor , 2012, Cell.

[41]  A. Børresen-Dale,et al.  Copynumber: Efficient algorithms for single- and multi-track copy number segmentation , 2012, BMC Genomics.

[42]  A. Need,et al.  A genome-wide genetic signature of Jewish ancestry perfectly separates individuals with and without full Jewish ancestry in a large random sample of European Americans , 2009, Genome Biology.

[43]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[44]  Yu-ping Wang,et al.  Comparative Studies of Copy Number Variation Detection Methods for Next-Generation Sequencing Technologies , 2013, PloS one.

[45]  D. Pinkel,et al.  Array comparative genomic hybridization and its applications in cancer , 2005, Nature Genetics.

[46]  J. Kitzman,et al.  Personalized Copy-Number and Segmental Duplication Maps using Next-Generation Sequencing , 2009, Nature Genetics.

[47]  Alberto Magi,et al.  Bioinformatics for Next Generation Sequencing Data , 2010, Genes.

[48]  Joshua M. Korn,et al.  Discovery and genotyping of genome structural polymorphism by sequencing on a population scale , 2011, Nature Genetics.

[49]  John Quackenbush,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection: ExomeCNV , 2011, Bioinform..

[50]  L. Feuk,et al.  Detection of large-scale variation in the human genome , 2004, Nature Genetics.

[51]  Kaisa Silander,et al.  Whole genome amplification with Phi29 DNA polymerase to enable genetic or genomic analysis of samples of low DNA yield. , 2008, Methods in molecular biology.

[52]  Nikos D. Sidiropoulos,et al.  Non-Negative Matrix Factorization Revisited: Uniqueness and Algorithm for Symmetric Decomposition , 2014, IEEE Transactions on Signal Processing.

[53]  Yu-Ping Wang,et al.  Detection of copy number variation from next generation sequencing data with total variation penalized least square optimization , 2011, 2011 IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW).

[54]  E. Eichler,et al.  Segmental duplications and copy-number variation in the human genome. , 2005, American journal of human genetics.

[55]  Jacob Biesinger,et al.  Solving Generalized FLSA with ADMM Algorithm for Copy Number Variation Detection in Human Genomes , 2011 .

[56]  Yiu-ming Cheung,et al.  Efficient Generalized Conditional Gradient with Gradient Sliding for Composite Optimization , 2015, IJCAI.

[57]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[58]  M. Wigler,et al.  Circular binary segmentation for the analysis of array-based DNA copy number data. , 2004, Biostatistics.

[59]  D. Conrad,et al.  Global variation in copy number in the human genome , 2006, Nature.

[60]  Ao Li,et al.  CLImAT: accurate detection of copy number alteration and loss of heterozygosity in impure and aneuploid tumor samples using whole-genome sequencing data , 2014, Bioinform..

[61]  X. Xie,et al.  Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell , 2012, Science.

[62]  Paul Medvedev,et al.  Computational methods for discovering structural variation with next-generation sequencing , 2009, Nature Methods.

[63]  J. Delhanty,et al.  Detailed chromosomal and molecular genetic analysis of single cells by whole genome amplification and comparative genomic hybridisation. , 1999, Nucleic acids research.

[64]  SathirapongsasutiJarupon Fah,et al.  Exome sequencing-based copy-number variation and loss of heterozygosity detection , 2011 .

[65]  Kenny Q. Ye,et al.  Sensitive and accurate detection of copy number variants using read depth of coverage. , 2009, Genome research.

[66]  P. Sullivan,et al.  Schizophrenia as a complex trait: evidence from a meta-analysis of twin studies. , 2003, Archives of general psychiatry.

[67]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[68]  Yannick Deville,et al.  Linear-Quadratic Blind Source Separation Using NMF to Unmix Urban Hyperspectral Images , 2014, IEEE Transactions on Signal Processing.

[69]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[70]  Derek Y. Chiang,et al.  High-resolution mapping of copy-number alterations with massively parallel sequencing , 2009, Nature Methods.

[71]  Sharon J. Diskin,et al.  Copy number variation at 1q21.1 associated with neuroblastoma , 2009, Nature.

[72]  R. Tibshirani,et al.  Sparsity and smoothness via the fused lasso , 2005 .

[73]  Gary D Bader,et al.  Functional impact of global rare copy number variation in autism spectrum disorders , 2010, Nature.

[74]  Ira M. Hall,et al.  Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome. , 2010, Genome research.

[75]  Huan Zhang,et al.  DeAnnCNV: a tool for online detection and annotation of copy number variations from whole-exome sequencing data , 2015, Nucleic Acids Res..

[76]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[77]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[78]  John Quackenbush,et al.  What would you do if you could sequence everything? , 2008, Nature Biotechnology.

[79]  Deng Cai,et al.  Laplacian Score for Feature Selection , 2005, NIPS.

[80]  Christopher A. Miller,et al.  ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads , 2011, PloS one.

[81]  L. Hood,et al.  The synthesis of oligonucleotides containing an aliphatic amino group at the 5' terminus: synthesis of fluorescent DNA primers for use in DNA sequence analysis. , 1985, Nucleic acids research.

[82]  K. Lange,et al.  RECONSTRUCTING DNA COPY NUMBER BY PENALIZED ESTIMATION AND IMPUTATION. , 2009, The annals of applied statistics.

[83]  Ryan J. Tibshirani,et al.  Efficient Implementations of the Generalized Lasso Dual Path Algorithm , 2014, ArXiv.

[84]  Joseph T. Glessner,et al.  PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. , 2007, Genome research.

[85]  Alexander G. Gray,et al.  Stochastic Alternating Direction Method of Multipliers , 2013, ICML.

[86]  Yiu-Ming Cheung,et al.  A Total-variation Constrained Permutation Model for Revealing Common Copy Number Patterns , 2017, Scientific Reports.

[87]  Nevenka Dimitrova,et al.  Optimizing sparse sequencing of single cells for highly multiplex copy number profiling , 2015, Genome research.

[88]  Yiu-ming Cheung,et al.  Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization , 2016, Machine Learning.

[89]  N. Navin Cancer genomics: one cell at a time , 2014, Genome Biology.

[90]  Christian Gieger,et al.  Six new loci associated with body mass index highlight a neuronal influence on body weight regulation , 2009, Nature Genetics.

[91]  M. Gerstein,et al.  CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. , 2011, Genome research.

[92]  Chris H. Q. Ding,et al.  Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[93]  Martin Sill,et al.  Robust biclustering by sparse singular value decomposition incorporating stability selection , 2011, Bioinform..

[94]  A. Rinaldo Properties and refinements of the fused lasso , 2008, 0805.0234.

[95]  J. Troge,et al.  Tumour evolution inferred by single-cell sequencing , 2011, Nature.

[96]  M. Daly,et al.  Genome-wide association studies for common diseases and complex traits , 2005, Nature Reviews Genetics.

[97]  Manuel Corpas,et al.  DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. , 2009, American journal of human genetics.

[98]  Michael K. Ng,et al.  SKM-SNP: SNP markers detection method , 2010, J. Biomed. Informatics.

[99]  Hongyu Zhao,et al.  Multisample aCGH Data Analysis via Total Variation and Spectral Regularization , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[100]  A. Oudenaarden,et al.  Validation of noise models for single-cell transcriptomics , 2014, Nature Methods.

[101]  Yu-Ping Wang,et al.  CNV-TV: A robust method to discover copy number variation from short sequencing reads , 2013, BMC Bioinformatics.

[102]  Seungtai Yoon,et al.  Detecting common copy number variants in high-throughput sequencing data by using JointSLM algorithm , 2011, Nucleic acids research.

[103]  David L. Donoho,et al.  De-noising by soft-thresholding , 1995, IEEE Trans. Inf. Theory.

[104]  G. Getz,et al.  GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers , 2011, Genome Biology.

[105]  Michael K. Ng,et al.  SNMFCA: Supervised NMF-Based Image Classification and Annotation , 2012, IEEE Transactions on Image Processing.

[106]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[107]  Fanzhang Li,et al.  Semi-supervised concept factorization for document clustering , 2016, Inf. Sci..

[108]  Andri Mirzal,et al.  Nonparametric Tikhonov Regularized NMF and Its Application in Cancer Clustering , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[109]  Jesús S. Aguilar-Ruiz,et al.  Biclustering on expression data: A review , 2015, J. Biomed. Informatics.

[110]  Yan Song,et al.  nbCNV: a multi-constrained optimization model for discovering copy number variants in single-cell sequencing data , 2016, BMC Bioinformatics.

[111]  Roger S Lasken,et al.  Single-cell genomic sequencing using Multiple Displacement Amplification. , 2007, Current opinion in microbiology.