Genome-scale de novo assembly using ALGA

Abstract Motivation There are very few methods for de novo genome assembly based on the overlap graph approach. It is considered as giving more exact results than the so-called de Bruijn graph approach but in much greater time and of much higher memory usage. It is not uncommon that assembly methods involving the overlap graph model are not able to successfully compute greater datasets, mainly due to memory limitation of a computer. This was the reason for developing in last decades mainly de Bruijn-based assembly methods, fast and fairly accurate. However, the latter methods can fail for longer or more repetitive genomes, as they decompose reads to shorter fragments and lose a part of information. An efficient assembler for processing big datasets and using the overlap graph model is still looked out. Results We propose a new genome-scale de novo assembler based on the overlap graph approach, designed for short-read sequencing data. The method, ALGA, incorporates several new ideas resulting in more exact contigs produced in short time. Among these ideas, we have creation of a sparse but quite informative graph, reduction of the graph including a procedure referring to the problem of minimum spanning tree of a local subgraph, and graph traversal connected with simultaneous analysis of contigs stored so far. What is rare in genome assembly, the algorithm is almost parameter-free, with only one optional parameter to be set by a user. ALGA was compared with nine state-of-the-art assemblers in tests on genome-scale sequencing data obtained from real experiments on six organisms, differing in size, coverage, GC content and repetition rate. ALGA produced best results in the sense of overall quality of genome reconstruction, understood as a good balance between genome coverage, accuracy and length of resulting sequences. The algorithm is one of tools involved in processing data in currently realized national project Genomic Map of Poland. Availability and implementation ALGA is available at http://alga.put.poznan.pl. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Roberto Solis-Oba,et al.  SAGE: String-overlap Assembly of GEnomes , 2014, BMC Bioinformatics.

[2]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[3]  Michal Kierzynka,et al.  GRASShopPER—An algorithm for de novo assembly based on GPU alignments , 2018, PloS one.

[4]  Yongchao Liu,et al.  Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data , 2013, Bioinform..

[5]  Inanç Birol,et al.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species , 2013, GigaScience.

[6]  Jian Wang,et al.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler , 2012, GigaScience.

[7]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..

[8]  N. Siva UK gears up to decode 100 000 genomes from NHS patients , 2015, The Lancet.

[9]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[10]  Adam Ameur,et al.  Single-Molecule Sequencing: Towards Clinical Applications. , 2019, Trends in biotechnology.

[11]  K. Khrapko,et al.  [Determination of the nucleotide sequence of DNA using hybridization with oligonucleotides. A new method]. , 1988, Doklady Akademii nauk SSSR.

[12]  Stefan R. Henz,et al.  Epigenomic Diversity in a Global Collection of Arabidopsis thaliana Accessions , 2016, Cell.

[13]  Timothy D Minogue,et al.  Next-Generation Sequencing for Biodefense: Biothreat Detection, Forensics, and the Clinic. , 2019, Clinical chemistry.

[14]  Karsten M. Borgwardt,et al.  1,135 Genomes Reveal the Global Pattern of Polymorphism in Arabidopsis thaliana , 2016, Cell.

[15]  Jacek Blazewicz,et al.  Graph algorithms for DNA sequencing - origins, current models and the future , 2018, Eur. J. Oper. Res..

[16]  S. Kurtz,et al.  Readjoiner: a fast and memory efficient string graph-based sequence assembler , 2012, BMC Bioinformatics.

[17]  P. Pevzner 1-Tuple DNA sequencing: computer analysis. , 1989, Journal of biomolecular structure & dynamics.

[18]  Leen-Jan van Doorn,et al.  Impact of Host DNA and Sequencing Depth on the Taxonomic Resolution of Whole Metagenome Sequencing for Microbiome Analysis , 2019, Front. Microbiol..

[19]  Michael S. Waterman,et al.  A New Algorithm for DNA Sequence Assembly , 1995, J. Comput. Biol..

[20]  Eugene W. Myers,et al.  A whole-genome assembly of Drosophila. , 2000, Science.

[21]  Eugene W. Myers,et al.  Combinatorial algorithms for DNA sequence assembly , 1995, Algorithmica.

[22]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[23]  Mosè Manni,et al.  BUSCO: Assessing Genome Assembly and Annotation Completeness. , 2019, Methods in molecular biology.

[24]  Jacek Blazewicz,et al.  A heuristic managing errors for DNA sequencing , 2002, Bioinform..

[25]  Piotr Gawron,et al.  Whole genome assembly from 454 sequencing output via modified DNA graph concept , 2009, Comput. Biol. Chem..

[26]  Tetsuya Hayashi,et al.  Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads , 2014, Genome research.

[27]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  R. Durbin,et al.  Efficient de novo assembly of large genomes using compressed data structures. , 2012, Genome research.