Non-biological synthetic spike-in controls and the AMPtk software pipeline improve mycobiome data

High-throughput amplicon sequencing (HTAS) of conserved DNA regions is a powerful technique to characterize microbial communities. Recently, spike-in mock communities have been used to measure accuracy of sequencing platforms and data analysis pipelines. To assess the ability of sequencing platforms and data processing pipelines using fungal internal transcribed spacer (ITS) amplicons, we created two ITS spike-in control mock communities composed of cloned DNA in plasmids: a biological mock community, consisting of ITS sequences from fungal taxa, and a synthetic mock community (SynMock), consisting of non-biological ITS-like sequences. Using these spike-in controls we show that: (1) a non-biological synthetic control (e.g., SynMock) is the best solution for parameterizing bioinformatics pipelines, (2) pre-clustering steps for variable length amplicons are critically important, (3) a major source of bias is attributed to the initial polymerase chain reaction (PCR) and thus HTAS read abundances are typically not representative of starting values. We developed AMPtk, a versatile software solution equipped to deal with variable length amplicons and quality filter HTAS data based on spike-in controls. While we describe herein a non-biological SynMock community for ITS sequences, the concept and AMPtk software can be widely applied to any HTAS dataset to improve data quality.

[1]  William A. Walters,et al.  Accurate Estimation of Fungal Diversity and Abundance through Improved Lineage-Specific Primers Optimized for Illumina Amplicon Sequencing , 2016, Applied and Environmental Microbiology.

[2]  Andy F. S. Taylor,et al.  The UNITE database for molecular identification of fungi--recent updates and future perspectives. , 2010, The New phytologist.

[3]  Martin Kircher,et al.  Addressing challenges in the production and analysis of illumina sequencing data , 2011, BMC Genomics.

[4]  C. Gratton,et al.  An improved method for utilizing high‐throughput amplicon sequencing to determine the diets of insectivorous animals , 2019, Molecular ecology resources.

[5]  John D. Hunter,et al.  Matplotlib: A 2D Graphics Environment , 2007, Computing in Science & Engineering.

[6]  John L. Spouge,et al.  Nuclear ribosomal internal transcribed spacer (ITS) region as a universal DNA barcode marker for Fungi , 2012, Proceedings of the National Academy of Sciences.

[7]  F. De Filippis,et al.  Different Amplicon Targets for Sequencing-Based Studies of Fungal Diversity , 2017, Applied and Environmental Microbiology.

[8]  Susan Holmes,et al.  phyloseq: An R Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data , 2013, PloS one.

[9]  Philippe Esling,et al.  Accurate multiplexing and filtering for high-throughput amplicon-sequencing , 2015, Nucleic acids research.

[10]  L. Tedersoo,et al.  PacBio metabarcoding of Fungi and other eukaryotes: errors, biases and perspectives. , 2018, The New phytologist.

[11]  J. Vandermeer,et al.  Identification of Putative Coffee Rust Mycoparasites via Single-Molecule DNA Sequencing of Infected Pustules , 2015, Applied and Environmental Microbiology.

[12]  Kabir G. Peay,et al.  Sequence Depth, Not PCR Replication, Improves Ecological Inference from Next Generation DNA Sequencing , 2014, PloS one.

[13]  Nicholas A. Bokulich,et al.  Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing , 2012, Nature Methods.

[14]  E. Wright,et al.  Quality filtering of Illumina index reads mitigates sample cross-talk , 2016, BMC Genomics.

[15]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[16]  Andreas Wilke,et al.  The Biological Observation Matrix (BIOM) format or: how I learned to stop worrying and love the ome-ome , 2012, GigaScience.

[17]  D. Lindner,et al.  Intragenomic variation in the ITS rDNA region obscures phylogenetic relationships and inflates estimates of operational taxonomic units in genus Laetiporus , 2011, Mycologia.

[18]  Lauren C. Cline,et al.  Probing promise versus performance in longer read fungal metabarcoding. , 2018, The New phytologist.

[19]  L. Raskin,et al.  PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets , 2012, PloS one.

[20]  H. Friberg,et al.  New primers to amplify the fungal ITS2 region--evaluation by 454-sequencing of artificial and natural communities. , 2012, FEMS microbiology ecology.

[21]  Martin Sosic,et al.  Edlib: a C/C++ library for fast, exact sequence alignment using edit distance , 2016, bioRxiv.

[22]  Martin F. Polz,et al.  Bias in Template-to-Product Ratios in Multitemplate PCR , 1998, Applied and Environmental Microbiology.

[23]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[24]  T. Bruns,et al.  Quantifying microbial communities with 454 pyrosequencing: does read abundance count? , 2010, Molecular ecology.

[25]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[26]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[27]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[28]  Austen R. D. Ganley,et al.  Highly efficient concerted evolution in the ribosomal DNA repeats: total rDNA repeat variation revealed by whole-genome shotgun sequence data. , 2007, Genome research.

[29]  H. Ochman,et al.  Illumina-based analysis of microbial community diversity , 2011, The ISME Journal.

[30]  Robert C. Edgar,et al.  SINTAX: a simple non-Bayesian taxonomy classifier for 16S and ITS sequences , 2016, bioRxiv.

[31]  Dieter M. Tourlousse,et al.  Synthetic spike-in standards for high-throughput 16S rRNA gene amplicon sequencing , 2016, Nucleic acids research.

[32]  Kristine Bohmann,et al.  Tag jumps illuminated – reducing sequence‐to‐sample misidentifications in metabarcoding studies , 2015, Molecular ecology resources.

[33]  Susan P. Holmes,et al.  Waste Not , Want Not : Why Rarefying Microbiome Data is Inadmissible . October 1 , 2013 , 2013 .

[34]  D. Lindner,et al.  Molecular phylogeny of Laetiporus and other brown rot polypore genera in North America , 2008, Mycologia.

[35]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[36]  Gaël Varoquaux,et al.  The NumPy Array: A Structure for Efficient Numerical Computation , 2011, Computing in Science & Engineering.

[37]  Anthony M. Zador,et al.  Sources of PCR-induced distortions in high-throughput sequencing data sets , 2014, bioRxiv.

[38]  Ben Nichols,et al.  VSEARCH: a versatile open source tool for metagenomics , 2016, PeerJ.

[39]  Tim Booth,et al.  PIPITS: an automated pipeline for analyses of fungal internal transcribed spacer sequences from the Illumina sequencing platform , 2015, Methods in ecology and evolution.

[40]  Bartek Wilczynski,et al.  Biopython: freely available Python tools for computational molecular biology and bioinformatics , 2009, Bioinform..

[41]  Daniel L. Lindner,et al.  Don't make a mista(g)ke: is tag switching an overlooked source of error in amplicon pyrosequencing studies? , 2012 .

[42]  M. Öpik,et al.  Navigating the labyrinth: a guide to sequence-based, community ecology of arbuscular mycorrhizal fungi. , 2015, The New phytologist.

[43]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[44]  Scott T. Bates,et al.  FUNGuild: An open annotation tool for parsing fungal community datasets by ecological guild , 2016 .

[45]  M. Weiß,et al.  Intragenomic variation of fungal ribosomal genes is higher than previously thought. , 2008, Molecular biology and evolution.

[46]  M. W. Taylor,et al.  Evaluating the Impact of DNA Extraction Method on the Representation of Human Oral Bacterial and Fungal Communities , 2017, PloS one.

[47]  Robert Gentleman,et al.  ShortRead: a bioconductor package for input, quality assessment and exploration of high-throughput sequence data , 2009, Bioinform..

[48]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[49]  Marcus L. Roper,et al.  Nuclear and Genome Dynamics in Multinucleate Ascomycete Fungi , 2011, Current Biology.

[50]  T. White Amplification and direct sequencing of fungal ribosomal RNA genes for phylogenetics , 1990 .

[51]  T. Fennell,et al.  Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries , 2011, Genome Biology.

[52]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[53]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[54]  Xiao-Tao Jiang,et al.  Effects of polymerase, template dilution and cycle number on PCR based 16 S rRNA diversity analysis using the deep sequencing method , 2010, BMC Microbiology.

[55]  Robert C. Edgar,et al.  UNBIAS: An attempt to correct abundance bias in 16S sequencing, with limited success , 2017, bioRxiv.

[56]  Martin Kircher,et al.  Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform , 2011, Nucleic acids research.

[57]  M. Callaham,et al.  Polymerase matters: non-proofreading enzymes inflate fungal community richness estimates by up to 15 % , 2015 .

[58]  J. Bengtsson-Palme,et al.  ITS1: a DNA barcode better than ITS2 in eukaryotes? , 2015, Molecular ecology resources.

[59]  K. Peay,et al.  Parsing ecological signal from noise in next generation amplicon sequencing. , 2015, The New phytologist.

[60]  Jonathan A. Eisen,et al.  Incorporating 16S Gene Copy Number Information Improves Estimates of Microbial Diversity and Abundance , 2012, PLoS Comput. Biol..

[61]  R. Henrik Nilsson,et al.  Improved software detection and extraction of ITS1 and ITS2 from ribosomal ITS sequences of fungi and other eukaryotes for analysis of environmental sequencing data , 2013 .

[62]  R. Henrik Nilsson,et al.  Finding needles in haystacks: linking scientific names, reference specimens and molecular data for Fungi , 2014, Database J. Biol. Databases Curation.

[63]  T. Bruns,et al.  ITS primers with enhanced specificity for basidiomycetes ‐ application to the identification of mycorrhizae and rusts , 1993, Molecular ecology.