Misassembly detection using paired-end sequence reads and optical mapping data

Motivation: A crucial problem in genome assembly is the discovery and correction of misassembly errors in draft genomes. We develop a method called misSEQuel that enhances the quality of draft genomes by identifying misassembly errors and their breakpoints using paired-end sequence reads and optical mapping data. Our method also fulfills the critical need for open source computational methods for analyzing optical mapping data. We apply our method to various assemblies of the loblolly pine, Francisella tularensis, rice and budgerigar genomes. We generated and used stimulated optical mapping data for loblolly pine and F.tularensis and used real optical mapping data for rice and budgerigar. Results: Our results demonstrate that we detect more than 54% of extensively misassembled contigs and more than 60% of locally misassembled contigs in assemblies of F.tularensis and between 31% and 100% of extensively misassembled contigs and between 57% and 73% of locally misassembled contigs in assemblies of loblolly pine. Using the real optical mapping data, we correctly identified 75% of extensively misassembled contigs and 100% of locally misassembled contigs in rice, and 77% of extensively misassembled contigs and 80% of locally misassembled contigs in budgerigar. Availability and implementation: misSEQuel can be used as a post-processing step in combination with any genome assembler and is freely available at http://www.cs.colostate.edu/seq/. Contact: muggli@cs.colostate.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Rod A Wing,et al.  Assembly and Validation of the Genome of the Nonmodel Basal Angiosperm Amborella , 2013, Science.

[2]  Richard M. Clark,et al.  Sequencing of natural strains of Arabidopsis thaliana with short reads. , 2008, Genome research.

[3]  Joshua M. Stuart,et al.  Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. , 2009, The Journal of heredity.

[4]  Peter A. Meric,et al.  Lineage-Specific Biology Revealed by a Finished Genome Assembly of the Mouse , 2009, PLoS biology.

[5]  Susan J. Brown,et al.  Creating a buzz about insect genomes. , 2011, Science.

[6]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[7]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[8]  Isaac Y. Ho,et al.  Meraculous: De Novo Genome Assembly with Short Paired-End Reads , 2011, PloS one.

[9]  T. Anantharaman,et al.  A probabilistic analysis of false positives in optical map alignment and validation , 2001 .

[10]  Scott C Edmunds,et al.  Peering into peer-review at GigaScience , 2013, GigaScience.

[11]  Hamidreza Chitsaz,et al.  SEQuel: improving the accuracy of genome assemblies , 2012, Bioinform..

[12]  Sergey Koren,et al.  Automated ensemble assembly and validation of microbial genomes , 2014, BMC Bioinformatics.

[13]  Miron Livny,et al.  Validation of rice genome sequence by optical mapping , 2007, BMC Genomics.

[14]  Richard J. Roberts,et al.  REBASE—a database for DNA restriction and modification: enzymes, genes and genomes , 2009, Nucleic Acids Res..

[15]  C. Nusbaum,et al.  ALLPATHS: de novo assembly of whole-genome shotgun microreads. , 2008, Genome research.

[16]  Roberto Solis-Oba,et al.  SAGE: String-overlap Assembly of GEnomes , 2014, BMC Bioinformatics.

[17]  Pavel A Pevzner,et al.  How to apply de Bruijn graphs to genome assembly. , 2011, Nature biotechnology.

[18]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[19]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[20]  Nilgun Donmez,et al.  Hapsembler: An Assembler for Highly Polymorphic Genomes , 2011, RECOMB.

[21]  T R Tiersch,et al.  On the evolution of genome size of birds. , 1991, The Journal of heredity.

[22]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[23]  D. Schwartz,et al.  Improvement of the Oryza sativa Nipponbare reference genome using next generation sequence and optical map data , 2013, Rice.

[24]  Steven Salzberg,et al.  Beware of mis-assembled genomes , 2005, Bioinform..

[25]  E. Dimalanta,et al.  A Whole-Genome Shotgun Optical Map of Yersinia pestis Strain KIM , 2002, Applied and Environmental Microbiology.

[26]  Jessica Severin,et al.  Shotgun optical mapping of the entire Leishmania major Friedlin genome. , 2004, Molecular and biochemical parasitology.

[27]  David C. Schwartz,et al.  A Single Molecule Scaffold for the Maize Genome , 2009, PLoS genetics.

[28]  David C. Schwartz,et al.  Whole-Genome Shotgun Optical Mapping of Rhodospirillum rubrum , 2004, Applied and Environmental Microbiology.

[29]  David C. Schwartz,et al.  High-resolution human genome structure by single-molecule analysis , 2010, Proceedings of the National Academy of Sciences.

[30]  Herman Goossens,et al.  Employing whole genome mapping for optimal de novo assembly of bacterial genomes , 2014, BMC Research Notes.

[31]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Juliane D. Klein,et al.  LOCAS – A Low Coverage Assembly Tool for Resequencing Projects , 2011, PloS one.

[33]  Md. Shamsuzzoha Bayzid,et al.  Whole-genome analyses resolve early branches in the tree of life of modern birds , 2014, Science.

[34]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[35]  Siu-Ming Yiu,et al.  IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth , 2012, Bioinform..

[36]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[37]  Christina Boucher,et al.  Efficient Indexed Alignment of Contigs to Optical Maps , 2014, WABI.

[38]  Juan J de Pablo,et al.  A microfluidic system for large DNA molecule arrays. , 2004, Analytical chemistry.

[39]  Le-Shin Wu,et al.  Decoding the massive genome of loblolly pine using haploid DNA and novel assembly strategies , 2014, Genome Biology.

[40]  Haixu Tang,et al.  De novo repeat classification and fragment assembly , 2004, RECOMB.

[41]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[42]  K. Lindblad-Toh,et al.  Assisted assembly: how to improve a de novo genome assembly by using related species , 2009, Genome Biology.

[43]  Mihai Pop,et al.  Computational methods for optical mapping , 2014, GigaScience.

[44]  Loretta Auvil,et al.  Reference-assisted chromosome assembly , 2013, Proceedings of the National Academy of Sciences.

[45]  Florent E. Angly,et al.  Next Generation Sequence Assembly with AMOS , 2011, Current protocols in bioinformatics.

[46]  David C. Schwartz,et al.  Optical Mapping in Genomic Analysis , 2006 .

[47]  Deacon J. Sweeney,et al.  Sequencing and automated whole-genome optical mapping of the genome of a domestic goat (Capra hircus) , 2012, Nature Biotechnology.

[48]  Qian Qian,et al.  Proteomic analysis of a disease-resistance-enhanced lesion mimic mutant spotted leaf 5 in rice , 2013, Rice.

[49]  Bud Mishra,et al.  False Positives in Genomic Map Assembly and Sequence Validation , 2001, WABI.

[50]  Daijin Ko,et al.  Enriching for correct prediction of biological processes using a combination of diverse classifiers , 2011, BMC Bioinformatics.

[51]  Nilgun Donmez,et al.  SCARPA: scaffolding reads with practical algorithms , 2013, Bioinform..

[52]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[53]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[54]  Mihai Pop,et al.  Scaffolding and validation of bacterial genome assemblies using optical restriction maps , 2008, Bioinform..

[55]  J. Hofkens,et al.  Optical mapping of DNA: Single‐molecule‐based methods for mapping genomes , 2011, Biopolymers.

[56]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[57]  Huanming Yang,et al.  De novo assembly of human genomes with massively parallel short read sequencing. , 2010, Genome research.

[58]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[59]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[60]  David C. Schwartz,et al.  AGORA: Assembly Guided by Optical Restriction Alignment , 2012, BMC Bioinformatics.

[61]  D. Schwartz,et al.  Ordered restriction maps of Saccharomyces cerevisiae chromosomes constructed by optical mapping. , 1993, Science.

[62]  Steven J. M. Jones,et al.  Abyss: a Parallel Assembler for Short Read Sequence Data Material Supplemental Open Access , 2022 .

[63]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[64]  R. Knight,et al.  The human microbiome project: exploring the microbial part of ourselves in a changing world , 2022 .

[65]  David C. Schwartz,et al.  Statistical Significance of Optical Map Alignments , 2012, J. Comput. Biol..

[66]  James R. Knight,et al.  High-coverage sequencing and annotated assemblies of the budgerigar genome , 2014, GigaScience.