V-Phaser 2: variant inference for viral

Background: Massively parallel sequencing offers the possibility of revolutionizing the study of viral populations by providing ultra deep sequencing (tens to hundreds of thousand fold coverage) of complete viral genomes. However, differentiation of true low frequency variants from sequencing errors remains challenging. Results: We developed a software package, V-Phaser 2, for inferring intrahost diversity within viral populations. This program adds three major new methodologies to the state of the art: a technique to efficiently utilize paired end read data for calling phased variants, a new strategy to represent and infer length polymorphisms, and an in line filter for erroneous calls arising from systematic sequencing artifacts. We have also heavily optimized memory and run time performance. This combination of algorithmic and technical advances allows V-Phaser 2 to fully utilize extremely deep paired end sequencing data (such as generated by Illumina sequencers) to accurately infer low frequency intrahost variants in viral populations in reasonable time on a standard desktop computer. V-Phaser 2 was validated and compared to both QuRe and the original V-Phaser on three datasets obtained from two viral populations: a mixture of eight known strains of West Nile Virus (WNV) sequenced on both 454 Titanium and Illumina MiSeq and a mixture of twenty-four known strains of WNV sequenced only on 454 Titanium. V-Phaser 2 outperformed the other two programs in both sensitivity and specificity while using more than five fold less time and memory. Conclusions: We developed V-Phaser 2, a publicly available software tool (V-Phaser 2 can be accessed via: http://www.broadinstitute.org/scientific-community/science/projects/viral-genomics/v-phaser-2 and is freely available for academic use) that enables the efficient analysis of ultra-deep sequencing data produced by common next generation sequencing platforms for viral populations.

[1]  Nicholas Eriksson,et al.  ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data , 2011, BMC Bioinformatics.

[2]  H. Swerdlow,et al.  A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers , 2012, BMC Genomics.

[3]  Lior Pachter,et al.  Development of a Low Bias Method for Characterizing Viral Populations Using Next Generation Sequencing Technology , 2010, PloS one.

[4]  E. Holmes,et al.  Intra- and Interhost Evolutionary Dynamics of Equine Influenza Virus , 2010, Journal of Virology.

[5]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[6]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[7]  Mattia C. F. Prosperi,et al.  QuRe: software for viral quasispecies reconstruction from next-generation sequencing data , 2012, Bioinform..

[8]  Huldrych F. Günthard,et al.  Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection , 2012, PLoS pathogens.

[9]  Wei-June Chen,et al.  Study of Sequence Variation of Dengue Type 3 Virus in Naturally Infected Mosquitoes and Human Hosts: Implications for Transmission and Evolution , 2004, Journal of Virology.

[10]  Jun S. Liu,et al.  STATISTICAL APPLICATIONS OF THE POISSON-BINOMIAL AND CONDITIONAL BERNOULLI DISTRIBUTIONS , 1997 .

[11]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[12]  David L. Robertson,et al.  Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II , 2012, BMC Bioinformatics.

[13]  Aaron R. Quinlan,et al.  BamTools: a C++ API and toolkit for analyzing and managing BAM files , 2011, Bioinform..

[14]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[15]  Joseph P. Romano,et al.  Generalizations of the familywise error rate , 2005, math/0507420.

[16]  Elizabeth M. Ryan,et al.  De novo assembly of highly diverse viral populations , 2012, BMC Genomics.

[17]  A. Hughes,et al.  Analysis of Hepatitis C Virus Intrahost Diversity across the Coding Region by Ultradeep Pyrosequencing , 2012, Journal of Virology.

[18]  Niko Beerenwinkel,et al.  Ultra-deep sequencing for the analysis of viral populations. , 2011, Current opinion in virology.