Non-biological synthetic spike-in controls and the AMPtk software pipeline improve fungal high throughput amplicon sequencing data

High throughput amplicon sequencing (HTAS) of conserved regions of DNA has emerged as a powerful technique to characterize environmental communities. Depending on the taxa of interest, several molecular markers have been developed to use as templates for HTAS; small subunit (16S) rRNA, large subunit (LSU) rRNA, internal transcribed spacer (ITS) rRNA, or mitochondrial cytochrome oxidase 1 (mtCO1). A major challenge in analyzing HTAS experimental data is differentiating sequencing errors versus those of real biological sequence variation. One strategy that has been employed is the inclusion of spike-in mock communities as a tool to measure the accuracy of the sequencing platform and subsequent data processing. HTAS for identification of fungi via the ITS region requires a pipeline that can accurately deal with amplicons of variable length, variable GC content, and large homopolymer stretches. To assess the ability of sequencing platforms and data processing pipelines using fungal ITS amplicons, we created two fungal ITS spike-in control mock communities composed of single copy plasmid DNA; the biological mock community (BioMock) consists of cloned ITS sequences whereas the synthetic mock community (SynMock) consists of non-biological ITS-like sequences. Using these spike-in controls we show that pre-clustering steps for variable length amplicons are critically important. Additionally, a major source of bias is attributed to the primary DNA sequence of individuals as well as the initial PCR reaction, indicating that read abundances are not representative of fungal communities. These data indicate that using non-biological ITS-like sequences as a spike-in control allows for accurate estimation of data processing and identification of index-bleed (index-hopping or barcode crossover) between samples. Finally, we developed AMPtk (amplicon toolkit), a versatile software solution equipped to deal with variable length amplicons as well as a built-in method to quality filter HTAS data based on spike-in controls. While we describe herein a non-biological (synthetic) mock community for ITS sequences, the concept can be widely applied to any HTAS dataset.

[1]  Martin Kircher,et al.  Addressing challenges in the production and analysis of illumina sequencing data , 2011, BMC Genomics.

[2]  C. Gratton,et al.  An improved method for utilizing high‐throughput amplicon sequencing to determine the diets of insectivorous animals , 2019, Molecular ecology resources.

[3]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[4]  Xiao-Tao Jiang,et al.  Effects of polymerase, template dilution and cycle number on PCR based 16 S rRNA diversity analysis using the deep sequencing method , 2010, BMC Microbiology.

[5]  T. Gant,et al.  Amplicon –Based Metagenomic Analysis of Mixed Fungal Samples Using Proton Release Amplicon Sequencing , 2014, PloS one.

[6]  D. Lindner,et al.  Intragenomic variation in the ITS rDNA region obscures phylogenetic relationships and inflates estimates of operational taxonomic units in genus Laetiporus , 2011, Mycologia.

[7]  Philippe Esling,et al.  Accurate multiplexing and filtering for high-throughput amplicon-sequencing , 2015, Nucleic acids research.

[8]  L. Raskin,et al.  PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets , 2012, PloS one.

[9]  T. White Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics , 1990 .

[10]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[11]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[12]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[13]  E. Wright,et al.  Quality filtering of Illumina index reads mitigates sample cross-talk , 2016, BMC Genomics.

[14]  H. Ochman,et al.  Illumina-based analysis of microbial community diversity , 2011, The ISME Journal.

[15]  T. Bruns,et al.  Quantifying microbial communities with 454 pyrosequencing: does read abundance count? , 2010, Molecular ecology.

[16]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[17]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[18]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[19]  D. Lindner,et al.  Molecular phylogeny of Laetiporus and other brown rot polypore genera in North America , 2008, Mycologia.

[20]  Austen R. D. Ganley,et al.  Highly efficient concerted evolution in the ribosomal DNA repeats: total rDNA repeat variation revealed by whole-genome shotgun sequence data. , 2007, Genome research.

[21]  Anthony M. Zador,et al.  Sources of PCR-induced distortions in high-throughput sequencing data sets , 2014, bioRxiv.

[22]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[23]  Kabir G. Peay,et al.  Sequence Depth, Not PCR Replication, Improves Ecological Inference from Next Generation DNA Sequencing , 2014, PloS one.

[24]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[25]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[26]  L. Tedersoo,et al.  PacBio metabarcoding of Fungi and other eukaryotes: errors, biases and perspectives. , 2018, The New phytologist.

[27]  Scott T. Bates,et al.  FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild , 2016 .

[28]  M. Weiß,et al.  Intragenomic variation of fungal ribosomal genes is higher than previously thought. , 2008, Molecular biology and evolution.

[29]  H. Friberg,et al.  New primers to amplify the fungal ITS2 region--evaluation by 454-sequencing of artificial and natural communities. , 2012, FEMS microbiology ecology.

[30]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[31]  Martin F. Polz,et al.  Bias in Template-to-Product Ratios in Multitemplate PCR , 1998, Applied and Environmental Microbiology.

[32]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[33]  Daniel L. Lindner,et al.  Don't make a mista(g)ke: is tag switching an overlooked source of error in amplicon pyrosequencing studies? , 2012 .

[34]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[35]  T. Bruns,et al.  ITS primers with enhanced specificity for basidiomycetes ‐ application to the identification of mycorrhizae and rusts , 1993, Molecular ecology.

[36]  Dieter M. Tourlousse,et al.  Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing , 2016, Nucleic acids research.

[37]  M. W. Taylor,et al.  Evaluating the Impact of DNA Extraction Method on the Representation of Human Oral Bacterial and Fungal Communities , 2017, PloS one.

[38]  M. Callaham,et al.  Polymerase matters: non-proofreading enzymes inflate fungal community richness estimates by up to 15 % , 2015 .

[39]  K. Peay,et al.  Parsing ecological signal from noise in next generation amplicon sequencing. , 2015, The New phytologist.

[40]  Marcus L. Roper,et al.  Nuclear and Genome Dynamics in Multinucleate Ascomycete Fungi , 2011, Current Biology.

[41]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[42]  Ben Nichols,et al.  VSEARCH: a versatile open source tool for metagenomics , 2016, PeerJ.

[43]  Kyle Bittinger,et al.  Optimizing methods and dodging pitfalls in microbiome research , 2017, Microbiome.

[44]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[45]  F. De Filippis,et al.  Different Amplicon Targets for Sequencing-Based Studies of Fungal Diversity , 2017, Applied and Environmental Microbiology.

[46]  William A. Walters,et al.  Accurate Estimation of Fungal Diversity and Abundance through Improved Lineage-Specific Primers Optimized for Illumina Amplicon Sequencing , 2016, Applied and Environmental Microbiology.

[47]  Andy F. S. Taylor,et al.  The UNITE database for molecular identification of fungi--recent updates and future perspectives. , 2010, The New phytologist.

[48]  J. Vandermeer,et al.  Identification of Putative Coffee Rust Mycoparasites via Single-Molecule DNA Sequencing of Infected Pustules , 2015, Applied and Environmental Microbiology.

[49]  Andreas Wilke,et al.  The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome , 2012, GigaScience.

[50]  Robert Gentleman,et al.  ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data , 2009, Bioinform..

[51]  Martin Kircher,et al.  Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform , 2011, Nucleic acids research.

[52]  Robert C. Edgar,et al.  SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences , 2016, bioRxiv.

[53]  Kristine Bohmann,et al.  Tag jumps illuminated – reducing sequence‐to‐sample misidentifications in metabarcoding studies , 2015, Molecular ecology resources.

[54]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[55]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.