Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads

The Illumina DNA sequencing platform generates accurate but short reads, which can be used to produce accurate but fragmented genome assemblies. Pacific Biosciences and Oxford Nanopore Technologies DNA sequencing platforms generate long reads that can produce complete genome assemblies, but the sequencing is more expensive and error-prone. There is significant interest in combining data from these complementary sequencing technologies to generate more accurate “hybrid” assemblies. However, few tools exist that truly leverage the benefits of both types of data, namely the accuracy of short reads and the structural resolving power of long reads. Here we present Unicycler, a new tool for assembling bacterial genomes from a combination of short and long reads, which produces assemblies that are accurate, complete and cost-effective. Unicycler builds an initial assembly graph from short reads using the de novo assembler SPAdes and then simplifies the graph using information from short and long reads. Unicycler uses a novel semi-global aligner to align long reads to the assembly graph. Tests on both synthetic and real reads show Unicycler can assemble larger contigs with fewer misassemblies than other hybrid assemblers, even when long-read depth and accuracy are low. Unicycler is open source (GPLv3) and available at github.com/rrwick/Unicycler.

[1]  Jie Dong,et al.  Genome dynamics and diversity of Shigella species, the etiologic agents of bacillary dysentery , 2005, Nucleic acids research.

[2]  Mark A. Schembri,et al.  The Complete Genome Sequence of Escherichia coli EC958: A High Quality Reference Sequence for the Globally Disseminated Multidrug Resistant E. coli O25b:H4-ST131 Clone , 2014, PloS one.

[3]  David Tse,et al.  Near-optimal assembly for shotgun sequencing with noisy reads , 2014, BMC Bioinformatics.

[4]  M. Pop,et al.  Sequence assembly demystified , 2013, Nature Reviews Genetics.

[5]  Jacqueline A. Keane,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, Genome Biology.

[6]  Patricia Siguier,et al.  ISfinder: the reference centre for bacterial insertion sequences , 2005, Nucleic Acids Res..

[7]  D. Higgins,et al.  T-Coffee: A novel method for fast and accurate multiple sequence alignment. , 2000, Journal of molecular biology.

[8]  Alla Lapidus,et al.  ExSPAnder: a universal repeat resolver for DNA fragment assembly , 2014, Bioinform..

[9]  I. Weissman,et al.  Index switching causes “spreading-of-signal” among multiplexed samples in Illumina HiSeq 4000 DNA sequencing , 2017, bioRxiv.

[10]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[11]  Ling-Hui Li,et al.  Genome Sequencing and Comparative Analysis of Klebsiella pneumoniae NTUH-K2044, a Strain Causing Liver Abscess and Meningitis , 2009, Journal of bacteriology.

[12]  P. Loewen,et al.  Genome Sequence of an Extremely Drug-Resistant Clinical Isolate of Acinetobacter baumannii Strain AB030 , 2014, Genome Announcements.

[13]  Ilan Shomorony,et al.  Do read errors matter for genome assembly? , 2015, 2015 IEEE International Symposium on Information Theory (ISIT).

[14]  I. Goodhead,et al.  Rapid Evolution of Virulence and Drug Resistance in the Emerging Zoonotic Pathogen Streptococcus suis , 2009, PloS one.

[15]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[16]  Knut Reinert,et al.  Segment-based multiple sequence alignment , 2008, ECCB.

[17]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[18]  M. Berriman,et al.  A comprehensive evaluation of assembly scaffolding tools , 2014, Genome Biology.

[19]  R. Wilson,et al.  Complete genome sequence of Salmonella enterica serovar Typhimurium LT2 , 2001, Nature.

[20]  M. Frith,et al.  Adaptive seeds tame genomic sequence comparison. , 2011, Genome research.

[21]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[22]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[23]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[24]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[25]  D. Posada,et al.  A comparison of tools for the simulation of genomic next-generation sequencing data , 2016, Nature Reviews Genetics.

[26]  L. Jermiin,et al.  Statistical tests to identify appropriate types of nucleotide sequence recoding in molecular phylogenetics , 2014, BMC Bioinformatics.

[27]  Thomas L. Madden,et al.  The BLAST Sequence Analysis Tool , 2013 .

[28]  C. Nusbaum,et al.  Finished bacterial genomes from shotgun sequence data , 2012, Genome research.

[29]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[30]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[31]  Leena Salmela,et al.  LoRDEC: accurate and efficient long read error correction , 2014, Bioinform..

[32]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[33]  Torsten Seemann,et al.  Ten recommendations for creating usable bioinformatics command line software , 2013, GigaScience.

[34]  K. Holt,et al.  Genome Sequence of Acinetobacter baumannii Strain A1, an Early Example of Antibiotic-Resistant Global Clone 1 , 2015, Genome Announcements.

[35]  Aaron A. Klammer,et al.  Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data , 2013, Nature Methods.

[36]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[37]  Monica Riley,et al.  Escherichia coli K-12: a cooperatively developed annotation snapshot—2005 , 2006, Nucleic acids research.

[38]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[39]  Zhong Wang,et al.  ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies , 2013, Bioinform..

[40]  N. McCallum,et al.  Whole genome sequencing in clinical and public health microbiology , 2015, Pathology.

[41]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[42]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[43]  Justin Zobel,et al.  Bandage: interactive visualization of de novo genome assemblies , 2015, bioRxiv.

[44]  Stefan Engelen,et al.  Genome assembly using Nanopore-guided long and error-free DNA reads , 2015, BMC Genomics.

[45]  Edith D. Wong,et al.  The Reference Genome Sequence of Saccharomyces cerevisiae: Then and Now , 2013, G3: Genes, Genomes, Genetics.

[46]  Miriam L. Land,et al.  Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences , 2014, Bioinform..

[47]  Knut Reinert,et al.  Biological Sequence Analysis Using the SeqAn C++ Library , 2009, Chapman and Hall / CRC mathematical and computational biology series.

[48]  Dmitry Pushkarev,et al.  Reconstructing genetic history of Siberian and Northeastern European populations , 2015, bioRxiv.

[49]  R. Olsen,et al.  Molecular dissection of the evolution of carbapenem-resistant multilocus sequence type 258 Klebsiella pneumoniae , 2014, Proceedings of the National Academy of Sciences.

[50]  Adamandia Kapopoulou,et al.  TubercuList--10 years after. , 2011, Tuberculosis.

[51]  R. Agarwala,et al.  Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST , 2006, BMC Biology.