De novo detection of copy number variation by co-assembly

MOTIVATION Comparing genomes of individual organisms using next-generation sequencing data is, until now, mostly performed using a reference genome. This is challenging when the reference is distant and introduces bias towards the exact sequence present in the reference. Recent improvements in both sequencing read length and efficiency of assembly algorithms have brought direct comparison of individual genomes by de novo assembly, rather than through a reference genome, within reach. RESULTS Here, we develop and test an algorithm, named Magnolya, that uses a Poisson mixture model for copy number estimation of contigs assembled from sequencing data. We combine this with co-assembly to allow de novo detection of copy number variation (CNV) between two individual genomes, without mapping reads to a reference genome. In co-assembly, multiple sequencing samples are combined, generating a single contig graph with different traversal counts for the nodes and edges between the samples. In the resulting 'coloured' graph, the contigs have integer copy numbers; this negates the need to segment genomic regions based on depth of coverage, as required for mapping-based detection methods. Magnolya is then used to assign integer copy numbers to contigs, after which CNV probabilities are easily inferred. The copy number estimator and CNV detector perform well on simulated data. Application of the algorithms to hybrid yeast genomes showed allotriploid content from different origin in the wine yeast Y12, and extensive CNV in aneuploid brewing yeast genomes. Integer CNV was also accurately detected in a short-term laboratory-evolved yeast strain.

[1]  Misko Dzamba,et al.  Detecting copy number variation with mated short reads. , 2010, Genome research.

[2]  Dick de Ridder,et al.  Laboratory evolution of new lactate transporter genes in a jen1Δ mutant of Saccharomyces cerevisiae and their identification as ADY2 alleles by whole-genome resequencing and transcriptome analysis , 2012 .

[3]  Mark Johnston,et al.  Microbe domestication and the identification of the wild genetic stock of lager-brewing yeast , 2011, Proceedings of the National Academy of Sciences.

[4]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.

[5]  Fredrik Lysholm,et al.  An efficient simulator of 454 data using configurable statistical models , 2011, BMC Research Notes.

[6]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[7]  S. Hochreiter,et al.  cn.MOPS: mixture of Poissons for discovering copy number variations in next-generation sequencing data with a low false discovery rate , 2012, Nucleic acids research.

[8]  Chao Xie,et al.  CNV-seq, a new method to detect copy number variation using high-throughput sequencing , 2009, BMC Bioinformatics.

[9]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[10]  Edith D. Wong,et al.  Saccharomyces Genome Database: the genomics resource of budding yeast , 2011, Nucleic Acids Res..

[11]  Jens Nielsen,et al.  De novo sequencing, assembly and analysis of the genome of the laboratory strain Saccharomyces cerevisiae CEN.PK113-7D, a model for modern industrial biotechnology , 2012, Microbial Cell Factories.

[12]  Kai Ye,et al.  Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads , 2009, Bioinform..

[13]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[14]  Dick de Ridder,et al.  Laboratory evolution of new lactate transporter genes in a jen1Δ mutant of Saccharomyces cerevisiae and their identification as ADY2 alleles by whole-genome resequencing and transcriptome analysis. , 2012, FEMS yeast research.

[15]  G. McVean,et al.  De novo assembly and genotyping of variants using colored de Bruijn graphs , 2011, Nature Genetics.

[16]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[17]  Tatiana Popova,et al.  Supplementary Methods , 2012, Acta Neuropsychiatrica.

[18]  Masahira Hattori,et al.  Genome Sequence of the Lager Brewing Yeast, an Interspecies Hybrid , 2009, DNA research : an international journal for rapid publication of reports on genes and genomes.

[19]  Angus H. Forgan,et al.  The genome sequence of the wine yeast VIN7 reveals an allotriploid hybrid genome with Saccharomyces cerevisiae and Saccharomyces kudriavzevii origins. , 2012, FEMS yeast research.

[20]  Paul Medvedev,et al.  Maximum Likelihood Genome Assembly , 2009, J. Comput. Biol..

[21]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.