Inferring viral quasispecies spectra from 454 pyrosequencing reads

BackgroundRNA viruses infecting a host usually exist as a set of closely related sequences, referred to as quasispecies. The genomic diversity of viral quasispecies is a subject of great interest, particularly for chronic infections, since it can lead to resistance to existing therapies. High-throughput sequencing is a promising approach to characterizing viral diversity, but unfortunately standard assembly software was originally designed for single genome assembly and cannot be used to simultaneously assemble and estimate the abundance of multiple closely related quasispecies sequences.ResultsIn this paper, we introduce a new Vi ral Sp ectrum A ssembler (ViSpA) method for quasispecies spectrum reconstruction and compare it with the state-of-the-art ShoRAH tool on both simulated and real 454 pyrosequencing shotgun reads from HCV and HIV quasispecies. Experimental results show that ViSpA outperforms ShoRAH on simulated error-free reads, correctly assembling 10 out of 10 quasispecies and 29 sequences out of 40 quasispecies. While ShoRAH has a significant advantage over ViSpA on reads simulated with sequencing errors due to its advanced error correction algorithm, ViSpA is better at assembling the simulated reads after they have been corrected by ShoRAH. ViSpA also outperforms ShoRAH on real 454 reads. Indeed, 7 most frequent sequences reconstructed by ViSpA from a real HCV dataset are viable (do not contain internal stop codons), and the most frequent sequence was within 1% of the actual open reading frame obtained by cloning and Sanger sequencing. In contrast, only one of the sequences reconstructed by ShoRAH is viable. On a real HIV dataset, ShoRAH correctly inferred only 2 quasispecies sequences with at most 4 mismatches whereas ViSpA correctly reconstructed 5 quasispecies with at most 2 mismatches, and 2 out of 5 sequences were inferred without any mismatches. ViSpA source code is available at http://alla.cs.gsu.edu/~software/VISPA/vispa.html.ConclusionsViSpA enables accurate viral quasispecies spectrum reconstruction from 454 pyrosequencing reads. We are currently exploring extensions applicable to the analysis of high-throughput sequencing data from bacterial metagenomic samples and ecological samples of eukaryote populations.

[1]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[2]  A. Weiner,et al.  Hepatitis C virus (HCV) circulates as a population of different but closely related genomes: quasispecies nature of HCV genome distribution , 1992, Journal of virology.

[3]  Eugene W. Myers,et al.  The fragment assembly string graph , 2005, ECCB/JBI.

[4]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[5]  Peter F. Stadler,et al.  Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures , 2009, PLoS Comput. Biol..

[6]  Piotr Berman,et al.  HCV Quasispecies Assembly Using Network Flows , 2008, ISBRA.

[7]  S. Batzoglou,et al.  Whole-Genome Sequencing and Assembly with High-Throughput, Short-Read Technologies , 2007, PloS one.

[8]  J. Holland,et al.  Rapid evolution of RNA viruses. , 1987, Annual review of microbiology.

[9]  BMC Bioinformatics , 2005 .

[10]  Gene Myers,et al.  Building Fragment Assembly String Graphs , 2005 .

[11]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[12]  Shawn T. O'Neil,et al.  Robust haplotype reconstruction of eukaryotic read data with Hapler , 2011, 2011 IEEE 1st International Conference on Computational Advances in Bio and Medical Sciences (ICCABS).

[13]  M. Eigen,et al.  Molecular quasi-species. , 1988 .

[14]  Volker Roth,et al.  Deep Sequencing of a Genetically Heterogeneous Sample: Local Haplotype Reconstruction and Read Error Correction , 2009, RECOMB.

[15]  Thomas Lengauer,et al.  Data and text mining Computational methods for the design of effective therapies against drug resistant HIV strains , 2005 .

[16]  C. Rice,et al.  Hepatitis C virus continuously escapes from neutralizing antibody and T-cell responses during chronic infection in vivo. , 2007, Gastroenterology.

[17]  S. Elena,et al.  RNA virus quasispecies: significance for viral disease and epidemiology. , 1994, Infectious agents and disease.

[18]  Daniel C. Douek,et al.  The Rational Design of an AIDS Vaccine , 2006, Cell.

[19]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[20]  O. White,et al.  Environmental Genome Shotgun Sequencing of the Sargasso Sea , 2004, Science.

[21]  D. Lipman,et al.  National Center for Biotechnology Information , 2019, Springer Reference Medizin.

[22]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[23]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[24]  A. Branch,et al.  The quasispecies nature and biological implications of the hepatitis C virus. , 2009, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[25]  Russell Schwartz,et al.  Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem , 2002, Briefings Bioinform..

[26]  Alex Zelikovsky,et al.  2SNP: scalable phasing based on 2-SNP haplotypes , 2006, Bioinform..

[27]  Mostafa Ronaghi,et al.  Pyrosequencing™: An accurate detection platform for single nucleotide polymorphisms , 2002, Human mutation.

[28]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[29]  J. Drake,et al.  Mutation rates among RNA viruses. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[30]  E. Domingo,et al.  The quasispecies (extremely heterogeneous) nature of viral RNA genome populations: biological relevance--a review. , 1985, Gene.

[31]  Feng Gao,et al.  Diversity Considerations in HIV-1 Vaccine Selection , 2002, Science.

[32]  Susan P. Holmes,et al.  HIV-1 Subtype B Protease and Reverse Transcriptase Amino Acid Covariation , 2007, PLoS Comput. Biol..

[33]  Vineet Bafna,et al.  HapCUT: an efficient and accurate algorithm for the haplotype assembly problem , 2008, ECCB.

[34]  J. Holland,et al.  RNA virus populations as quasispecies. , 1992, Current topics in microbiology and immunology.

[35]  E. Domingo,et al.  RNA virus mutations and fitness for survival. , 1997, Annual review of microbiology.

[36]  Ncbi National Center for Biotechnology Information , 2008 .

[37]  Mark J. P. Chaisson,et al.  Short read fragment assembly of bacterial genomes. , 2008, Genome research.

[38]  Gabor T. Marth,et al.  Pyrobayes: an improved base caller for SNP discovery in pyrosequences , 2008, Nature Methods.