Graph analysis of fragmented long-read bacterial genome assemblies

MOTIVATION Long-read genome assembly tools are expected to reconstruct bacterial genomes nearly perfectly, however they still produce fragmented assemblies in some cases. It would be beneficial to understand whether these cases are intrinsically impossible to resolve, or if assemblers are at fault, implying that genomes could be refined or even finished with little to no additional experimental cost. RESULTS We propose a set of computational techniques to assist inspection of fragmented bacterial genome assemblies, through careful analysis of assembly graphs. By finding paths of overlapping raw reads between pairs of contigs, we recover potential short-range connections between contigs that were lost during the assembly process. We show that our procedure recovers 45% of missing contig adjacencies in fragmented Canu assemblies, on samples from the NCTC bacterial sequencing project. We also observe that a simple procedure based on enumerating weighted Hamiltonian cycles can suggest likely contig orderings. In our tests, the correct contig order is ranked first in half of the cases and within the top-3 predictions in nearly all evaluated cases, providing a direction for finishing fragmented long-read assemblies. AVAILABILITY https://gitlab.inria.fr/pmarijon/knot. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[2]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[3]  Anton Nekrutenko,et al.  Understanding trivial challenges of microbial genomics: An assembly example , 2018, bioRxiv.

[4]  Daniel D. Sommer,et al.  De novo likelihood-based measures for comparing genome assemblies , 2013, BMC Research Notes.

[5]  M. Schatz,et al.  Phased diploid genome assembly with single-molecule real-time sequencing , 2016, Nature Methods.

[6]  Justin Zobel,et al.  Bandage: interactive visualization of de novo genome assemblies , 2015, bioRxiv.

[7]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[8]  Steven D. Brown,et al.  A Case Study into Microbial Genome Assembly Gap Sequences and Finishing Strategies , 2017, Front. Microbiol..

[9]  Yu Lin,et al.  Assembly of Long Error-Prone Reads Using de Bruijn Graphs , 2016 .

[10]  S. Lonardi,et al.  A comparative evaluation of genome assembly reconciliation tools , 2017, Genome Biology.

[11]  Marghoob Mohiyuddin,et al.  LongISLND: in silico sequencing of lengthy and noisy datatypes , 2016, Bioinform..

[12]  N. Loman,et al.  A complete bacterial genome assembled de novo using only nanopore sequencing data , 2015, Nature Methods.

[13]  Pietro Liò,et al.  MeDuSa: a multi-draft based scaffolder , 2015, Bioinform..

[14]  Ryan R. Wick,et al.  Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads , 2016, bioRxiv.

[15]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[16]  Zhong Wang,et al.  ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies , 2013, Bioinform..

[17]  S. Koren,et al.  One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly. , 2015, Current opinion in microbiology.

[18]  Dmitry Antipov,et al.  hybridSPAdes: an algorithm for hybrid assembly of short and long reads , 2016, Bioinform..

[19]  M. Schatz,et al.  Genome assembly forensics: finding the elusive mis-assembly , 2008, Genome Biology.

[20]  Heng Li Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM , 2013, 1303.3997.

[21]  Sergey Koren,et al.  Aggressive assembly of pyrosequencing reads with mates , 2008, Bioinform..

[22]  F. Kremer,et al.  Approaches for in silico finishing of microbial genome sequences , 2017, Genetics and molecular biology.

[23]  Bud Mishra,et al.  Reevaluating Assembly Evaluations with Feature Response Curves: GAGE and Assemblathons , 2012, PloS one.

[24]  David Tse,et al.  Optimal assembly for high throughput shotgun sequencing , 2013, BMC Bioinformatics.

[25]  Heng Li,et al.  Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences , 2015, Bioinform..

[26]  Sergey Koren,et al.  Metagenomic assembly through the lens of validation: recent advances in assessing and improving the quality of genomes assembled from metagenomes , 2019, Briefings Bioinform..

[27]  Sven Rahmann,et al.  Genome analysis , 2022 .

[28]  J. Parkhill,et al.  Circlator: automated circularization of genome assemblies using long sequencing reads , 2015, bioRxiv.

[29]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[30]  Connor T. Skennerton,et al.  CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes , 2015, Genome research.

[31]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[32]  M. Berriman,et al.  REAPR: a universal tool for genome assembly evaluation , 2013, Genome Biology.

[33]  Ilan Shomorony,et al.  HINGE: Long-Read Assembly Achieves Optimal Repeat Resolution , 2016, bioRxiv.

[34]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[35]  L. Pachter,et al.  CGAL: computing genome assembly likelihoods , 2013 .

[36]  M. Touchon,et al.  Genesis, effects and fates of repeats in prokaryotic genomes. , 2009, FEMS microbiology reviews.