Measurement error and variant-calling in deep Illumina sequencing of HIV

Motivation Next-generation deep sequencing of viral genomes, particularly on the Illumina platform, is increasingly applied in HIV research. Yet, there is no standard protocol or method used by the research community to account for measurement errors that arise during sample preparation and sequencing. Correctly calling high and low frequency variants while controlling for erroneous variant calls is an important precursor to downstream interpretation, such as studying the emergence of HIV drug-resistance mutations, which in turn has clinical applications and can improve patient care. Results We developed a new variant-calling pipeline, hivmmer, for Illumina sequences from HIV viral genomes. First, we validated hivmmer by comparing it to other variant-calling pipelines on real HIV plasmid data sets, which have known sequences. We found that hivmmer achieves a lower rate of erroneous variant calls, and that all methods agree on the frequency of correctly called variants. Next, we compared the methods on an HIV plasmid data set that was sequenced using an amplicon-tagging protocol called Primer ID, which is designed to reduce errors and amplification bias during library preparation. We show that the Primer ID consensus does indeed have fewer erroneous variant calls compared to the variant-calling pipelines, and that hivmmer more closely approaches this low error rate compared to the other pipelines. Surprisingly, the frequency estimates from the Primer ID consensus do not differ significantly from those of the variant-calling pipelines. Finally, we built a predictive model for classifying errors in the hivmmer alignment, and show that it achieves high accuracy for identifying erroneous variant calls. Availability hivmmer is freely available for non-commercial use from https://github.com/mhowison/hivmmer. Contact mhowison@brown.edu

[1]  P. Mieczkowski,et al.  Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations , 2015, Journal of Virology.

[2]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[3]  M. Busch,et al.  Deep Sequencing of HIV-1 near Full-Length Proviral Genomes Identifies High Rates of BF1 Recombinants Including Two Novel Circulating Recombinant Forms (CRF) 70_BF1 and a Disseminating 71_BF1 among Blood Donors in Pernambuco, Brazil , 2014, PloS one.

[4]  C. Brumme HIV Drug Resistance Testing by High-Multiplex "Wide" , 2015 .

[5]  Jeffrey N. Martin,et al.  HIV Drug Resistance Testing by High-Multiplex “Wide” Sequencing on the MiSeq Instrument , 2015, Antimicrobial Agents and Chemotherapy.

[6]  A. Sönnerborg,et al.  Cost-efficient HIV-1 drug resistance surveillance using multiplexed high-throughput amplicon sequencing: implications for use in low- and middle-income countries. , 2014, The Journal of antimicrobial chemotherapy.

[7]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[8]  M. Busch,et al.  Ultra-Deep Sequencing of HIV-1 near Full-Length and Partial Proviral Genomes Reveals High Genetic Diversity among Brazilian Blood Donors , 2016, PloS one.

[9]  Jonathan E. Allen,et al.  Ultra-deep mutant spectrum profiling: improving sequencing accuracy using overlapping read pairs , 2013, BMC Genomics.

[10]  K. Metzner,et al.  A Comprehensive Analysis of Primer IDs to Study Heterogeneous HIV-1 Populations. , 2016, Journal of molecular biology.

[11]  G. Reyes-Terán,et al.  Deep sequencing: becoming a critical tool in clinical virology. , 2014, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[12]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[13]  Volker Roth,et al.  Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations , 2014, Nucleic acids research.

[14]  R. Paredes,et al.  Deep sequencing for HIV-1 clinical management. , 2017, Virus research.

[15]  Paul Sandstrom,et al.  Next-Generation Human Immunodeficiency Virus Sequencing for Patient Management and Drug Resistance Surveillance. , 2017, The Journal of infectious diseases.

[16]  J. Fellay,et al.  Easy and Accurate Reconstruction of Whole HIV Genomes from Short-Read Sequence Data , 2016, bioRxiv.

[17]  Sergei L. Kosakovsky Pond,et al.  Next generation sequencing improves detection of drug resistance mutations in infants after PMTCT failure. , 2015, Journal of clinical virology : the official publication of the Pan American Society for Clinical Virology.

[18]  J. Franklin,et al.  The elements of statistical learning: data mining, inference and prediction , 2005 .

[19]  D. Katzenstein,et al.  A 6-basepair insert in the reverse transcriptase gene of human immunodeficiency virus type 1 confers resistance to multiple nucleoside inhibitors. , 1998, The Journal of clinical investigation.

[20]  Richard J Orton,et al.  Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data , 2015, BMC Genomics.

[21]  C. Magis-Rodríguez,et al.  Pretreatment HIV-drug resistance in Mexico and its impact on the effectiveness of first-line antiretroviral therapy: a nationally representative 2015 WHO survey. , 2016, The lancet. HIV.

[22]  J. Mellors,et al.  Ultrasensitive single-genome sequencing: accurate, targeted, next generation sequencing of HIV-1 RNA , 2016, Retrovirology.

[23]  F. Zanini,et al.  Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing. , 2017, Virus research.

[24]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[25]  A. Hughes,et al.  Cross-clade simultaneous HIV drug resistance genotyping for reverse transcriptase, protease, and integrase inhibitor mutations by Illumina MiSeq , 2014, Retrovirology.

[26]  Chris Mungall,et al.  BioMake: a GNU make‐compatible utility for declarative workflow management , 2017, Bioinform..

[27]  Alyssa J. Porter,et al.  Comparison of Illumina and 454 Deep Sequencing in Participants Failing Raltegravir-Based Antiretroviral Therapy , 2014, PloS one.

[28]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[29]  M. Kozal,et al.  Deep sequencing of HIV: clinical and research applications. , 2014, Annual review of genomics and human genetics.

[30]  A. Poon,et al.  Promises and pitfalls of Illumina sequencing for HIV resistance genotyping. , 2017, Virus research.

[31]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[32]  Hirotaka Ode,et al.  Quasispecies Analyses of the HIV-1 Near-full-length Genome With Illumina MiSeq , 2015, Front. Microbiol..

[33]  Cassandra B. Jabara,et al.  Primer ID Informs Next-Generation Sequencing Platforms and Reveals Preexisting Drug Resistance Mutations in the HIV-1 Reverse Transcriptase Coding Domain. , 2015, AIDS research and human retroviruses.

[34]  High-specificity detection of rare alleles with Paired-End Low Error Sequencing (PELE-Seq) , 2016, BMC Genomics.

[35]  R. Neher,et al.  Challenges with Using Primer IDs to Improve Accuracy of Next Generation Sequencing , 2015, PloS one.

[36]  Cassandra B. Jabara,et al.  Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID , 2011, Proceedings of the National Academy of Sciences.

[37]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.