TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

Abstract Background Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (∼10×) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only ∼10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (∼12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.

[1]  Sergey Koren,et al.  Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome , 2019, Nature Biotechnology.

[2]  R. Gibbs,et al.  Mind the Gap: Upgrading Genomes with Pacific Biosciences RS Long-Read Sequencing Technology , 2012, PloS one.

[3]  Adonney Allan de Oliveira Veras,et al.  GapBlaster—A Graphical Gap Filler for Prokaryote Genomes , 2016, PloS one.

[4]  Evgeny M. Zdobnov,et al.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs , 2015, Bioinform..

[5]  Chengxi Ye,et al.  DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies , 2014, Scientific Reports.

[6]  Denis Bertrand,et al.  FinIS: Improved in silico Finishing Using an Exact Quadratic Programming Formulation , 2012, WABI.

[7]  Vitor R. C. Aguiar,et al.  Mapping Bias Overestimates Reference Allele Frequencies at the HLA Genes in the 1000 Genomes Project Phase I Data , 2014, G3: Genes, Genomes, Genetics.

[8]  Goutam Gupta,et al.  DNA repeats in the human genome , 2004, Genetica.

[9]  Huanming Yang,et al.  Draft genome of the living fossil Ginkgo biloba , 2016, GigaScience.

[10]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[11]  J. Zook,et al.  Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls , 2013, Nature Biotechnology.

[12]  Glenn Tesler,et al.  Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory , 2012, BMC Bioinformatics.

[13]  Lars Bolund,et al.  State of the art de novo assembly of human genomes from massively parallel sequencing data , 2010, Human Genomics.

[14]  Paolo Piazza,et al.  Comprehensive comparison of Pacific Biosciences and Oxford Nanopore Technologies and their applications to transcriptome analysis , 2017, F1000Research.

[15]  J. Dekker,et al.  Hi-C: a comprehensive technique to capture the conformation of genomes. , 2012, Methods.

[16]  Nic Herndon,et al.  Tools and pipelines for BioNano data: molecule assembly pipeline and FASTA super scaffolding tool , 2015, bioRxiv.

[17]  D. Branton,et al.  The potential and challenges of nanopore sequencing , 2008, Nature Biotechnology.

[18]  Christina A. Cuomo,et al.  Pilon: An Integrated Tool for Comprehensive Microbial Variant Detection and Genome Assembly Improvement , 2014, PloS one.

[19]  M. Berriman,et al.  Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps , 2010, Genome Biology.

[20]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[21]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[22]  Mostafa Ronaghi,et al.  Whole-genome haplotyping by dilution, amplification, and sequencing , 2013, Proceedings of the National Academy of Sciences.

[23]  Walter Pirovano,et al.  SSPACE-LongRead: scaffolding bacterial draft genomes using long read sequence information , 2014, BMC Bioinformatics.

[24]  Mick Watson,et al.  Errors in long-read assemblies can critically affect protein prediction , 2019, Nature Biotechnology.

[25]  Thomas L. Madden,et al.  BLAST: at the core of a powerful and diverse set of sequence analysis tools , 2004, Nucleic Acids Res..

[26]  René L. Warren,et al.  RAILS and Cobbler: Scaffolding and automated finishing of draft genomes using long DNA sequences , 2016, J. Open Source Softw..

[27]  Hanlee P. Ji,et al.  Haplotyping germline and cancer genomes using high-throughput linked-read sequencing , 2015, Nature Biotechnology.

[28]  Huimin Luo,et al.  SLR: a scaffolding algorithm based on long reads and contig classification , 2019, BMC Bioinformatics.

[29]  Brent S. Pedersen,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, Nature Biotechnology.

[30]  Adam M. Phillippy,et al.  Effect of sequence depth and length in long-read assembly of the maize inbred NC358 , 2020, Nature Communications.

[31]  Niranjan Nagarajan,et al.  Fast and accurate de novo genome assembly from long uncorrected reads. , 2017, Genome research.

[32]  A. Kasarskis,et al.  A window into third-generation sequencing. , 2010, Human molecular genetics.

[33]  R. Agarwala,et al.  Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST , 2006, BMC Biology.

[34]  Yan Zhang,et al.  LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly , 2018, GigaScience.

[35]  Michael Roberts,et al.  The MaSuRCA genome assembler , 2013, Bioinform..

[36]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[37]  V. Piro,et al.  FGAP: an automated gap closing tool , 2014, BMC Research Notes.

[38]  Hideki Hirakawa,et al.  GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments , 2015, Bioinform..

[39]  Xia Zhao,et al.  SLR-superscaffolder: a de novo scaffolding tool for synthetic long reads using a top-to-bottom scheme , 2019, bioRxiv.

[40]  Evan E. Eichler,et al.  An assessment of the sequence gaps: Unfinished business in a finished human genome , 2004, Nature Reviews Genetics.

[41]  W. Pirovano,et al.  Toward almost closed genomes with GapFiller , 2012, Genome Biology.

[42]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[43]  Jessica C. Ebert,et al.  Accurate whole genome sequencing and haplotyping from10-20 human cells , 2012, Nature.

[44]  Guangri Quan,et al.  A pipeline for completing bacterial genomes using in silico and wet lab approaches , 2015, BMC Genomics.