CATCh, an Ensemble Classifier for Chimera Detection in 16S rRNA Sequencing Studies

ABSTRACT In ecological studies, microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes, such as 16S rRNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences, often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (reference-based and de novo CATCh) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing our classifiers with existing tools in either the reference-based or de novo mode, a higher performance of our ensemble method was observed on a wide range of sequencing data, including simulated, 454 pyrosequencing, and Illumina MiSeq data sets. Since our algorithm combines the advantages of different individual chimera detection tools, our approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range, and various numbers of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units.

[1]  C. Wilson,et al.  Stimulation and suppression of PCR-mediated recombination. , 1998, Nucleic acids research.

[2]  Thomas Huber,et al.  Bellerophon: a program to detect chimeric sequences in multiple sequence alignments , 2004, Bioinform..

[3]  G. Wang,et al.  Frequency of formation of chimeric molecules as a consequence of PCR coamplification of 16S rRNA genes from mixed bacterial genomes , 1997, Applied and environmental microbiology.

[4]  Andrew J. Grimm,et al.  Reducing chimera formation during PCR amplification to ensure accurate genotyping. , 2010, Gene.

[5]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[6]  M. Polz,et al.  Heteroduplexes in mixed-template amplifications: formation, consequence and elimination by 'reconditioning PCR'. , 2002, Nucleic acids research.

[7]  C. Quince,et al.  Sample richness and genetic diversity as drivers of chimera formation in nSSU metagenetic analyses , 2012, Nucleic acids research.

[8]  Erik S. Wright,et al.  DECIPHER, a Search-Based Approach to Chimera Identification for 16S rRNA Sequences , 2011, Applied and Environmental Microbiology.

[9]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[10]  A. J. Jones,et al.  At Least 1 in 20 16S rRNA Sequence Records Currently Held in Public Repositories Is Estimated To Contain Substantial Anomalies , 2005, Applied and Environmental Microbiology.

[11]  R. Norman,et al.  Electrosynthesis of Commodity Chemicals by an Autotrophic Microbial Community , 2012, Applied and Environmental Microbiology.

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Bernard Henrissat,et al.  Organismal, genetic, and transcriptional variation in the deeply sequenced gut microbiomes of identical twins , 2010, Proceedings of the National Academy of Sciences.

[14]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[15]  S. Odelberg,et al.  Template-switching during DNA synthesis by Thermus aquaticus DNA polymerase I. , 1995, Nucleic acids research.

[16]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[17]  A. Chariton,et al.  Improved Inference of Taxonomic Richness from Environmental DNA , 2013, PloS one.

[18]  James R. Cole,et al.  Ribosomal Database Project: data and tools for high throughput rRNA analysis , 2013, Nucleic Acids Res..

[19]  C. Quince,et al.  Environmental microbiology through the lens of high-throughput DNA sequencing: synopsis of current platforms and bioinformatics approaches. , 2012, Journal of microbiological methods.

[20]  Daniel J. G. Lahr,et al.  Reducing the impact of PCR-mediated recombination in molecular evolution and environmental studies using a new-generation high-fidelity DNA polymerase. , 2009, BioTechniques.

[21]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[22]  G. Wang,et al.  The frequency of chimeric molecules as a consequence of PCR co-amplification of 16S rRNA genes from different bacterial species. , 1996, Microbiology.

[23]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[24]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[25]  Patrick D. Schloss,et al.  Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies , 2011, PloS one.

[26]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[27]  W Kelley Thomas,et al.  The nature and frequency of chimeras in eukaryotic metagenetic samples. , 2012, Journal of nematology.