Illumina error profiles: resolving fine-scale variation in metagenomic sequencing data

BackgroundIllumina’s sequencing platforms are currently the most utilised sequencing systems worldwide. The technology has rapidly evolved over recent years and provides high throughput at low costs with increasing read-lengths and true paired-end reads. However, data from any sequencing technology contains noise and our understanding of the peculiarities and sequencing errors encountered in Illumina data has lagged behind this rapid development.ResultsWe conducted a systematic investigation of errors and biases in Illumina data based on the largest collection of in vitro metagenomic data sets to date. We evaluated the Genome Analyzer II, HiSeq and MiSeq and tested state-of-the-art low input library preparation methods. Analysing in vitro metagenomic sequencing data allowed us to determine biases directly associated with the actual sequencing process. The position- and nucleotide-specific analysis revealed a substantial bias related to motifs (3mers preceding errors) ending in “GG”. On average the top three motifs were linked to 16 % of all substitution errors. Furthermore, a preferential incorporation of ddGTPs was recorded. We hypothesise that all of these biases are related to the engineered polymerase and ddNTPs which are intrinsic to any sequencing-by-synthesis method. We show that quality-score-based error removal strategies can on average remove 69 % of the substitution errors - however, the motif-bias remains.ConclusionSingle-nucleotide polymorphism changes in bacterial genomes can cause significant changes in phenotype, including antibiotic resistance and virulence, detecting them within metagenomes is therefore vital. Current error removal techniques are not designed to target the peculiarities encountered in Illumina sequencing data and other sequencing-by-synthesis methods, causing biases to persist and potentially affect any conclusions drawn from the data. In order to develop effective diagnostic and therapeutic approaches we need to be able to identify systematic sequencing errors and distinguish these errors from true genetic variation.

[1]  Cheng-Yao Chen DNA polymerases drive DNA sequencing-by-synthesis technologies: both past and present , 2014, Front. Microbiol..

[2]  W. Reznikoff Tn5 as a model for understanding DNA transposition , 2003, Molecular microbiology.

[3]  BMC Bioinformatics , 2005 .

[4]  Alexander Schönhuth,et al.  Discovering motifs that induce sequencing errors , 2013, BMC Bioinformatics.

[5]  Juliane C. Dohm,et al.  Substantial biases in ultra-short read data sets from high-throughput DNA sequencing , 2008, Nucleic acids research.

[6]  Fei Chen,et al.  The History and Advances of Reversible Terminators Used in New Generations of Sequencing Technology , 2013, Genom. Proteom. Bioinform..

[7]  Margaret C. Linak,et al.  Sequence-specific error profile of Illumina sequencers , 2011, Nucleic acids research.

[8]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[9]  Haiying Li Grunenwald,et al.  Next-generation sequencing library preparation: simultaneous fragmentation and tagging using in vitro transposition , 2009 .

[10]  Shawn W. Polson,et al.  Evaluation of a Transposase Protocol for Rapid Generation of Shotgun High-Throughput Sequencing Libraries from Nanogram Quantities of DNA , 2011, Applied and Environmental Microbiology.

[11]  Juliane C. Dohm,et al.  Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and Genome Analyzer systems , 2011, Genome Biology.

[12]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[13]  C. Quince,et al.  Insight into biases and sequencing errors for amplicon sequencing with the Illumina MiSeq platform , 2015, Nucleic acids research.

[14]  Jiannis Ragoussis,et al.  Preparation of high-quality next-generation sequencing libraries from picogram quantities of target DNA. , 2012, Genome research.

[15]  B. Ason,et al.  DNA sequence bias during Tn5 transposition. , 2004, Journal of molecular biology.

[16]  G. Petsko The blue marble , 2011, Genome Biology.

[17]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[18]  Lior Pachter,et al.  RESEARCH ARTICLE Open Access Identification and correction of systematic error in high-throughput sequence data , 2022 .

[19]  C. Fairhead,et al.  Insertion site preference of Mu, Tn5, and Tn7 transposons , 2012, Mobile DNA.

[20]  G. Waksman,et al.  Structure-based design of Taq DNA polymerases with improved properties of dideoxynucleotide incorporation. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[21]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[22]  C. Quince,et al.  Comparative metagenomic and rRNA microbial diversity characterization using archaeal and bacterial synthetic communities. , 2013, Environmental microbiology.