ViQUF: De Novo Viral Quasispecies Reconstruction Using Unitig-Based Flow Networks

During viral infection, intrahost mutation and recombination can lead to significant evolution, resulting in a population of viruses that harbor multiple haplotypes. The task of reconstructing these haplotypes from short-read sequencing data is called viral quasispecies assembly, and it can be categorized as a multiassembly problem. We consider the de novo version of the problem, where no reference is available. We present ViQUF, a de novo viral quasispecies assembler that addresses haplotype assembly and quantification. ViQUF obtains a first draft of the assembly graph from a de Bruijn graph. Then, solving a min-cost flow over a flow network built for each pair of adjacent vertices based on their paired-end information creates an approximate paired assembly graph with suggested frequency values as edge labels, which is the first frequency estimation. Then, original haplotypes are obtained through a greedy path reconstruction guided by a min-cost flow solution in the approximate paired assembly graph. ViQUF outputs the contigs with their frequency estimations. Results on real and simulated data show that ViQUF is at least four times faster using at most half of the memory than previous methods, while maintaining, and in some cases outperforming, the high quality of assembly and frequency estimation of overlap graph-based methodologies, which are known to be more accurate but slower than the de Bruijn graph-based approaches. Availability: ViQUF is freely available at: https://github.com/borjaf696/ViQUF

[1]  P. Pevzner,et al.  An Eulerian path approach to DNA fragment assembly , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Pavel Skums,et al.  Accurate assembly of minority viral haplotypes from next-generation sequencing through efficient noise reduction , 2018, bioRxiv.

[3]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[4]  Susana Ladra,et al.  Inference of viral quasispecies with a paired de Bruijn graph , 2020, Bioinform..

[5]  Alexander Schönhuth,et al.  De novo assembly of viral quasispecies using overlap graphs , 2017, bioRxiv.

[6]  E. Holmes,et al.  Rates of evolutionary change in viruses: patterns and determinants , 2008, Nature Reviews Genetics.

[7]  P. Pevzner,et al.  metaSPAdes: a new versatile metagenomic assembler. , 2017, Genome research.

[8]  Ion I. Mandoiu,et al.  Reconstruction of viral population structure from next-generation sequencing data using multicommodity flows , 2013, BMC Bioinformatics.

[9]  Raj Acharya,et al.  Estimating Viral Haplotypes in a Population Using k-mer Counting , 2013, PRIB.

[10]  Leen Stougie,et al.  Full-length de novo viral quasispecies assembly through variation graph construction , 2018, bioRxiv.

[11]  Keylie M. Gibson,et al.  Evaluation of haplotype callers for next-generation sequencing of viruses , 2019, bioRxiv.

[12]  Yingchao Zhao,et al.  De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding , 2018, bioRxiv.

[13]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[14]  B. Langmead,et al.  Reference flow: reducing reference bias using multiple population genomes , 2021, Genome biology.

[15]  Dominique Lavenier,et al.  GATB: Genome Assembly & Analysis Tool Box , 2014, Bioinform..

[16]  Piotr Berman,et al.  HCV Quasispecies Assembly Using Network Flows , 2008, ISBRA.

[17]  Paul Medvedev,et al.  Compacting de Bruijn graphs from sequencing data quickly and in low memory , 2016, Bioinform..

[18]  Gayle M. Wittenberg,et al.  EDAR: An Efficient Error Detection and Removal Algorithm for Next Generation Sequencing Data , 2010, J. Comput. Biol..

[19]  Paul Medvedev,et al.  Paired de Bruijn Graphs: A Novel Approach for Incorporating Mate Pair Information into Genome Assemblers , 2011, RECOMB.

[20]  Volker Roth,et al.  HIV Haplotype Inference Using a Propagating Dirichlet Process Mixture Model , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[21]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[22]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[23]  Alexandru I. Tomescu,et al.  A novel min-cost flow method for estimating transcript expression with RNA-Seq , 2013, BMC Bioinformatics.

[24]  Niema Moshiri,et al.  FAVITES: simultaneous simulation of transmission networks, phylogenetic trees and sequences , 2019, Bioinform..

[25]  Alexey A. Gurevich,et al.  MetaQUAST: evaluation of metagenome assemblies , 2016, Bioinform..

[26]  E. Domingo,et al.  Viral Quasispecies Evolution , 2012, Microbiology and Molecular Reviews.

[27]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[28]  Nicholas Mancuso,et al.  Algorithms for Viral Population Analysis , 2014 .

[29]  Mark J. P. Chaisson,et al.  De novo fragment assembly with short mate-paired reads: Does the read length matter? , 2009, Genome research.

[30]  Maximum Likelihood de novo reconstruction of viral populations using paired end sequencing data , 2015, 1502.04239.

[31]  Leen Stougie,et al.  Strain-aware assembly of genomes from mixed samples using flow variation graphs , 2019 .

[32]  B. Mahy,et al.  The Evolution and Emergence of RNA Viruses , 2010, Emerging Infectious Diseases.