Sequence Accuracy in Primary Databases: A Case Study on HIV-1B

This chapter revisits the history of sequencing methods and their advancements. It mainly focuses on the accuracy of the deposited sequences in primary public databases. The source of errors, frequency, errors due to sequencing, and sequence assembly, and their quality are discussed. The quality of sequencing pipelines and error rates of the next-generation sequencing (NGS) data are reviewed. Some tools and techniques to overcome errors are also reviewed. Sequence uncertainties in primary public databases are addressed with reference to HIV-1B sequences. The sequence ambiguities are highlighted along with annotations based on the reference genome (HXB2). There are ambiguities in sequences produced by different sequencing technologies and it is very difficult to distinguish true variants from the errors. This alarms data collection efforts and inferences derived from error-prone DNA-sequencing technologies. Future studies should be cautious in handling such sequences especially on analyzing mutations to understand pathogenesis, drug resistance, and geographical variations.

[1]  Michael C. Schatz,et al.  Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score , 2012, Bioinform..

[2]  Michael Y. Galperin,et al.  Sequence ― Evolution ― Function: Computational Approaches in Comparative Genomics , 2010 .

[3]  Ming-Daw Tsai,et al.  A reexamination of the nucleotide incorporation fidelity of DNA polymerases. , 2002, Biochemistry.

[4]  S. Wain-Hobson,et al.  Complex intrapatient sequence variation in the V1 and V2 hypervariable regions of the HIV-1 gp 120 envelope sequence. , 1992, Virology.

[5]  X. Saelens,et al.  Analysis of the genetic diversity of influenza A viruses using next-generation DNA sequencing , 2015, BMC Genomics.

[6]  Zhi Wei,et al.  An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data , 2013, 1401.2278.

[7]  T Kristensen,et al.  An estimate of the sequencing error frequency in the DNA sequence databases. , 1992, DNA sequence : the journal of DNA sequencing and mapping.

[8]  G. McVean,et al.  Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications , 2014, Nature Genetics.

[9]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[10]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[11]  H. Hakonarson,et al.  Using VAAST to identify an X-linked disorder resulting in lethality in male infants due to N-terminal acetyltransferase deficiency. , 2011, American journal of human genetics.

[12]  Philipp L. Wesche,et al.  DNA Sequence Error Rates in Genbank Records Estimated using the Mouse Genome as a Reference , 2004, DNA sequence : the journal of DNA sequencing and mapping.

[13]  X. de Lamballerie,et al.  Next generation sequencing of viral RNA genomes , 2013, BMC Genomics.

[14]  D. Murdoch,et al.  Contamination of Qiagen DNA Extraction Kits with Legionella DNA , 2003, Journal of Clinical Microbiology.

[15]  A. Barron,et al.  DNA sequencing and genotyping in miniaturized electrophoresis systems , 2004, Electrophoresis.

[16]  P. Lemey,et al.  Covering all bases in HIV research: unveiling a hidden world of viral evolution. , 2010, AIDS reviews.

[17]  M. Huynen,et al.  Molecular characterization of phosphoglycerate mutase in archaea. , 2002, FEMS microbiology letters.

[18]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[19]  C. Dionne,et al.  DNA Base Identification by Electron Microscopy , 2012, Microscopy and Microanalysis.

[20]  M. Ventra Fast DNA sequencing by electrical means inches closer. , 2013 .

[21]  F. Sanger,et al.  A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. , 1975, Journal of molecular biology.

[22]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[23]  S. Hummel,et al.  Evidence of contamination in PCR laboratory disposables , 1995, Naturwissenschaften.

[24]  Huldrych F. Günthard,et al.  Whole Genome Deep Sequencing of HIV-1 Reveals the Impact of Early Minor Variants Upon Immune Recognition During Acute Infection , 2012, PLoS pathogens.

[25]  H. Hakonarson,et al.  SNVer: a statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data , 2011, Nucleic acids research.

[26]  K. Metzner,et al.  Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data , 2012, Front. Microbio..

[27]  M. Schatz,et al.  Accurate detection of de novo and transmitted indels within exome-capture data using micro-assembly , 2014, Nature Methods.

[28]  Mauricio O. Carneiro,et al.  Pacific biosciences sequencing technology for genotyping and variation discovery in human data , 2012, BMC Genomics.

[29]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[30]  J. Margolick,et al.  Consistent Viral Evolutionary Changes Associated with the Progression of Human Immunodeficiency Virus Type 1 Infection , 1999, Journal of Virology.

[31]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[32]  H. Nakashita,et al.  The carboxyphosphonoenolpyruvate synthase-encoding gene from the bialaphos-producing organism Streptomyces hygroscopicus. , 1995, Gene.

[33]  A. Künstner,et al.  ConDeTri - A Content Dependent Read Trimmer for Illumina Data , 2011, PloS one.

[34]  M. Schatz,et al.  Hybrid error correction and de novo assembly of single-molecule sequencing reads , 2012, Nature Biotechnology.

[35]  S. Ehrlich,et al.  Replication slippage involves DNA polymerase pausing and dissociation , 2001, The EMBO journal.

[36]  J. M. Prober,et al.  A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. , 1987, Science.

[37]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[38]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[39]  Kai Wang,et al.  Identifying disease mutations in genomic medicine settings: current challenges and how to accelerate progress , 2012, Genome Medicine.

[40]  Niko Beerenwinkel,et al.  Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies , 2010, Nucleic acids research.

[41]  A. Tretyn,et al.  Sequencing technologies and genome sequencing , 2011, Journal of Applied Genetics.

[42]  George M. Church,et al.  Genomes for all. , 2006, Scientific American.

[43]  S. Litvak,et al.  Generation of G-to-A and C-to-U changes in HIV-1 transcripts by RNA editing. , 2000, Science.

[44]  Robert H. White,et al.  A divergent archaeal member of the alkaline phosphatase binuclear metalloenzyme superfamily has phosphoglycerate mutase activity , 2002, FEBS letters.

[45]  C. Nusbaum,et al.  Quality scores and SNP detection in sequencing-by-synthesis systems. , 2008, Genome research.

[46]  R. Gandhi,et al.  Update on human immunodeficiency virus (HIV)-2 infection. , 2011, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[47]  P Green,et al.  Base-calling of automated sequencer traces using phred. II. Error probabilities. , 1998, Genome research.

[48]  H. Hotzel,et al.  Residual DNA in thermostable DNA polymerases - a cause of irritation in diagnostic PCR and microarray assays. , 2007, Biologicals : journal of the International Association of Biological Standardization.

[49]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[50]  Christophe Klopp,et al.  Assessment of replicate bias in 454 pyrosequencing and a multi-purpose read-filtering tool , 2011, BMC Research Notes.

[51]  Cassandra B. Jabara,et al.  Accurate sampling and deep sequencing of the HIV-1 protease gene using a Primer ID , 2011, Proceedings of the National Academy of Sciences.

[52]  Ting Wang,et al.  Comparison of variations detection between whole-genome amplification methods used in single-cell resequencing , 2015, GigaScience.

[53]  T. Thomas,et al.  Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions , 2014, Microbial Informatics and Experimentation.

[54]  D. Richman,et al.  Comparison of Sequencing by Hybridization and Cycle Sequencing for Genotyping of Human Immunodeficiency Virus Type 1 Reverse Transcriptase , 2000, Journal of Clinical Microbiology.

[55]  R. Contreras,et al.  Complete nucleotide sequence of bacteriophage MS2 RNA: primary and secondary structure of the replicase gene , 1976, Nature.

[56]  S. Schuster Next-generation sequencing transforms today's biology , 2008, Nature Methods.

[57]  Y. Quentin,et al.  Analysis of errors in finished DNA sequences: the surfactin operon of Bacillus subtilis as an example. , 1995, Microbiology.

[58]  J. Weber,et al.  DNA Extraction Columns Contaminated with Murine Sequences , 2011, PloS one.

[59]  Astrid Gall,et al.  Universal Amplification, Next-Generation Sequencing, and Assembly of HIV-1 Genomes , 2012, Journal of Clinical Microbiology.

[60]  T. Borg,et al.  Introduction: Extracellular Matrix and Cardiovascular Remodeling—Using Microscopy to Delineate Mechanisms , 2012, Microscopy and Microanalysis.

[61]  Ute Baumann,et al.  Sequencing error correction without a reference genome , 2013, BMC Bioinformatics.

[62]  P. Shapshak,et al.  Global Protein Sequence Variation in HIV-1-B Isolates Derived from Human Blood and Brain , 2015 .

[63]  I. Heid,et al.  Haplotype Reconstruction Error as a Classical Misclassification Problem: Introducing Sensitivity and Specificity as Error Measures , 2008, PloS one.

[64]  A. Elliott,et al.  Rapid detection of the ACMG/ACOG-recommended 23 CFTR disease-causing mutations using ion torrent semiconductor sequencing. , 2012, Journal of biomolecular techniques : JBT.

[65]  M. DePristo,et al.  A framework for variation discovery and genotyping using next-generation DNA sequencing data , 2011, Nature Genetics.

[66]  W. Gilbert,et al.  A new method for sequencing DNA. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[67]  J. Ju,et al.  Mass-spectrometry DNA sequencing. , 2005, Mutation research.

[68]  E. Moreno,et al.  Bayesian robustness for hierarchical ε-contamination models , 1993 .

[69]  S. Salzberg,et al.  Bioinformatics challenges of new sequencing technology. , 2008, Trends in genetics : TIG.

[70]  P. Richterich,et al.  Estimation of errors in "raw" DNA sequences: a validation study. , 1998, Genome research.

[71]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[72]  P. Green,et al.  Base-calling of automated sequencer traces using phred. I. Accuracy assessment. , 1998, Genome research.

[73]  T. Thomas,et al.  GemSIM: general, error-model based simulator of next-generation sequencing data , 2012, BMC Genomics.

[74]  Michael Y. Galperin,et al.  Conserved core structure and active site residues in alkaline phosphatase superfamily enzymes , 2001, Proteins.

[75]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[76]  G. Learn,et al.  Intrapatient sequence variation of the gag gene of human immunodeficiency virus type 1 plasma virions , 1996, Journal of virology.

[77]  Shoshana Marcus,et al.  Error correction and assembly complexity of single molecule sequencing reads , 2014, bioRxiv.

[78]  Eleazar Eskin,et al.  Accurate viral population assembly from ultra-deep sequencing data , 2014, Bioinform..

[79]  Paul D. Shaw,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[80]  David L. Robertson,et al.  Analysis of high-depth sequence data for studying viral diversity: a comparison of next generation sequencing platforms using Segminator II , 2012, BMC Bioinformatics.

[81]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[82]  L. M. Mansky,et al.  Forward mutation rate of human immunodeficiency virus type 1 in a T lymphoid cell line. , 1996, AIDS research and human retroviruses.

[83]  Heng Li,et al.  A survey of sequence alignment algorithms for next-generation sequencing , 2010, Briefings Bioinform..

[84]  F. Sanger,et al.  DNA sequencing with chain-terminating inhibitors. , 1977, Proceedings of the National Academy of Sciences of the United States of America.

[85]  James O. Berger,et al.  An overview of robust Bayesian analysis , 1994 .

[86]  Yoshinori Fukui,et al.  Next-generation sequencing coupled with a cell-free display technology for high-throughput production of reliable interactome data , 2012, Scientific Reports.

[87]  David Stoddart,et al.  Single-nucleotide discrimination in immobilized DNA oligonucleotides with a biological nanopore , 2009, Proceedings of the National Academy of Sciences.

[88]  E. Holmes,et al.  Evolutionary aspects of recombination in RNA viruses. , 1999, The Journal of general virology.

[89]  M. Krystal,et al.  Prediction of Virological Response and Assessment of Resistance Emergence to the HIV-1 Attachment Inhibitor BMS-626529 During 8-Day Monotherapy With Its Prodrug BMS-663068 , 2013, Journal of acquired immune deficiency syndromes.

[90]  Gabor T. Marth,et al.  Haplotype-based variant detection from short-read sequencing , 2012, 1207.3907.

[91]  A. Moya,et al.  Contribution of Taq polymerase-induced errors to the estimation of RNA virus diversity. , 1998, The Journal of general virology.

[92]  P. Edman,et al.  A method for the determination of amino acid sequence in peptides. , 1949, Archives of biochemistry.

[93]  A. Jetzt,et al.  High Rate of Recombination throughout the Human Immunodeficiency Virus Type 1 Genome , 2000, Journal of Virology.

[94]  Scott Ferson,et al.  Constructing Probability Boxes and Dempster-Shafer Structures , 2003 .

[95]  J. Shendure,et al.  Materials and Methods Som Text Figs. S1 and S2 Tables S1 to S4 References Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome , 2022 .

[96]  H. Muller THE RELATION OF RECOMBINATION TO MUTATIONAL ADVANCE. , 1964, Mutation research.

[97]  A. Clark,et al.  Sequencing errors and molecular evolutionary analysis. , 1992, Molecular biology and evolution.

[98]  Rowena A. Bull,et al.  Sequential Bottlenecks Drive Viral Evolution in Early Acute Hepatitis C Virus Infection , 2011, PLoS pathogens.

[99]  Allen G. Rodrigo,et al.  Computational and Evolutionary Analysis of HIV Molecular Sequences , 2001, Springer US.

[100]  P. Scotti,et al.  On the nature of poliovirus genetic recombinants. , 1974, The Journal of general virology.

[101]  Ya-ping Zhang,et al.  Mitochondrial genomes of domestic animals need scrutiny , 2014, Molecular ecology.

[102]  B. Slatko,et al.  CircumVent thermal cycle sequencing and alternative manual and automated DNA sequencing protocols using the highly thermostable VentR (exo-) DNA polymerase. , 1992, BioTechniques.

[103]  K. Robasky,et al.  The role of replicates for error mitigation in next-generation sequencing , 2013, Nature Reviews Genetics.

[104]  Michael C. Zody,et al.  Highly Sensitive and Specific Detection of Rare Variants in Mixed Viral Populations from Massively Parallel Sequence Data , 2012, PLoS Comput. Biol..

[105]  D. Langley,et al.  In Vivo Patterns of Resistance to the HIV Attachment Inhibitor BMS-488043 , 2010, Antimicrobial Agents and Chemotherapy.

[106]  E. Lavezzo,et al.  Applications of Next-Generation Sequencing Technologies to Diagnostic Virology , 2011, International journal of molecular sciences.

[107]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[108]  Orin C. Shanks,et al.  Animal DNA in PCR reagents plagues ancient DNA research , 2007 .

[109]  A. Masoudi-Nejad,et al.  Emergence of Next-Generation Sequencing , 2013 .

[110]  Mark Akeson,et al.  Replication of Individual DNA Molecules under Electronic Control Using a Protein Nanopore , 2010, Nature nanotechnology.

[111]  Zheng Yang,et al.  A small molecule HIV-1 inhibitor that targets the HIV-1 envelope and inhibits CD4 receptor binding , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[112]  J. Martínez,et al.  Natural Antibiotic Resistance and Contamination by Antibiotic Resistance Determinants: The Two Ages in the Evolution of Resistance to Antimicrobials , 2012, Front. Microbio..

[113]  Elizabeth M. Ryan,et al.  Genome-Wide Patterns of Intrahuman Dengue Virus Diversity Reveal Associations with Viral Phylogenetic Clade and Interhost Diversity , 2012, Journal of Virology.

[114]  A. Meyerhans,et al.  DNA recombination during PCR. , 1990, Nucleic acids research.

[115]  Véronique Martin,et al.  Mapping Reads on a Genomic Sequence: An Algorithmic Overview and a Practical Comparative Analysis , 2012, J. Comput. Biol..

[116]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[117]  Timothy B. Stockwell,et al.  Evaluation of next generation sequencing platforms for population targeted sequencing studies , 2009, Genome Biology.

[118]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[119]  Scott Ferson,et al.  Accounting for uncertainty in DNA sequencing data. , 2015, Trends in genetics : TIG.

[120]  Takahiro Kanagawa,et al.  Bias and artifacts in multitemplate polymerase chain reactions (PCR). , 2003, Journal of bioscience and bioengineering.

[121]  Johnf . Thompson,et al.  Single Molecule Sequencing with a HeliScope Genetic Analysis System , 2010, Current protocols in molecular biology.

[122]  F. van Nieuwerburgh,et al.  Library construction for next-generation sequencing: overviews and challenges. , 2014, BioTechniques.

[123]  T. Smith,et al.  Corruption of genomic databases with anomalous sequence. , 1992, Nucleic acids research.

[124]  Yingrui Li,et al.  SOAPindel: Efficient identification of indels from short paired reads , 2013, Genome research.

[125]  J. Holland,et al.  RNA virus populations as quasispecies. , 1992, Current topics in microbiology and immunology.

[126]  Jeffrey A. Hussmann,et al.  High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing , 2013, Proceedings of the National Academy of Sciences.

[127]  Kan Liu,et al.  BIGpre: A Quality Assessment Package for Next-Generation Sequencing Data , 2011, Genom. Proteom. Bioinform..

[128]  M. Guyer,et al.  Assessing the quality of the DNA sequence from the Human Genome Project. , 1999, Genome research.

[129]  P. D. Rijk,et al.  Optimized filtering reduces the error rate in detecting genomic variants by short-read sequencing , 2011, Nature Biotechnology.

[130]  K Bebenek,et al.  The accuracy of reverse transcriptase from HIV-1. , 1988, Science.

[131]  Steven M. Johnson,et al.  A high-resolution, nucleosome position map of C. elegans reveals a lack of universal sequence-dictated positioning. , 2008, Genome research.

[132]  William J. Bruno,et al.  Classification of hepatitis C virus and human immunodeficiency virus-1 sequences with the branching index. , 2008, The Journal of general virology.

[133]  Christopher Quince,et al.  Benchmarking of viral haplotype reconstruction programmes: an overview of the capacities and limitations of currently available programmes , 2014, Briefings Bioinform..

[134]  W. Fiers,et al.  Nucleotide Sequence of the Gene Coding for the Bacteriophage MS2 Coat Protein , 1972, Nature.

[135]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[136]  Kendra N. Pesko,et al.  Complete viral RNA genome sequencing of ultra-low copy samples by sequence-independent amplification , 2012, Nucleic acids research.

[137]  A. Das,et al.  HIV-1 RNA editing, hypermutation, and error-prone reverse transcription. , 2001, Science.

[138]  K. Nakayama,et al.  Use of Illumina Deep Sequencing Technology To Differentiate Hepatitis C Virus Variants , 2011, Journal of Clinical Microbiology.

[139]  B. Korber,et al.  HIV sequence compendium 2002 , 2002 .