FlowClus: efficiently filtering and denoising pyrosequenced amplicons

BackgroundReducing the effects of sequencing errors and PCR artifacts has emerged as an essential component in amplicon-based metagenomic studies. Denoising algorithms have been designed that can reduce error rates in mock community data, but they change the sequence data in a manner that can be inconsistent with the process of removing errors in studies of real communities. In addition, they are limited by the size of the dataset and the sequencing technology used.ResultsFlowClus uses a systematic approach to filter and denoise reads efficiently. When denoising real datasets, FlowClus provides feedback about the process that can be used as the basis to adjust the parameters of the algorithm to suit the particular dataset. When used to analyze a mock community dataset, FlowClus produced a lower error rate compared to other denoising algorithms, while retaining significantly more sequence information. Among its other attributes, FlowClus can analyze longer reads being generated from all stages of 454 sequencing technology, as well as from Ion Torrent. It has processed a large dataset of 2.2 million GS-FLX Titanium reads in twelve hours; using its more efficient (but less precise) trie analysis option, this time was further reduced, to seven minutes.ConclusionsMany of the amplicon-based metagenomics datasets generated over the last several years have been processed through a denoising pipeline that likely caused deleterious effects on the raw data. By using FlowClus, one can avoid such negative outcomes while maintaining control over the filtering and denoising processes. Because of its efficiency, FlowClus can be used to re-analyze multiple large datasets together, thereby leading to more standardized conclusions. FlowClus is freely available on GitHub (jsh58/FlowClus); it is written in C and supported on Linux.

[1]  Lior Pachter,et al.  Viral Population Estimation Using Pyrosequencing , 2007, PLoS Comput. Biol..

[2]  F. Bushman,et al.  QIIME allows integration and analysis of high-throughput community sequencing data. Nat. Meth. , 2010 .

[3]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[4]  Christian Gabriel,et al.  Routine performance and errors of 454 HLA exon sequencing in diagnostics , 2013, BMC Bioinformatics.

[5]  Inge Jonassen,et al.  Filtering duplicate reads from 454 pyrosequencing data , 2013, Bioinform..

[6]  L. Raskin,et al.  PCR Biases Distort Bacterial and Archaeal Community Structure in Pyrosequencing Datasets , 2012, PloS one.

[7]  V. Kunin,et al.  Wrinkles in the rare biosphere: pyrosequencing errors can lead to artificial inflation of diversity estimates. , 2009, Environmental microbiology.

[8]  J. Marden,et al.  Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing , 2008, Molecular ecology.

[9]  Susan P. Holmes,et al.  Denoising PCR-amplified metagenome data , 2012, BMC Bioinformatics.

[10]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[11]  Yongchao Liu,et al.  HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data , 2014, BMC Bioinformatics.

[12]  Inge Jonassen,et al.  Systematic exploration of error sources in pyrosequencing flowgram data , 2011, Bioinform..

[13]  C R Woese,et al.  The phylogeny of prokaryotes. , 1980, Microbiological sciences.

[14]  Inge Jonassen,et al.  Characteristics of 454 pyrosequencing data—enabling realistic simulation with flowsim , 2010, Bioinform..

[15]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[16]  Bernard P. Puc,et al.  An integrated semiconductor device enabling non-optical genome sequencing , 2011, Nature.

[17]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[18]  Susan M. Huse,et al.  Accuracy and quality of massively parallel DNA pyrosequencing , 2007, Genome Biology.

[19]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[20]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[21]  Emese Meglécz,et al.  Accuracy and quality assessment of 454 GS-FLX Titanium pyrosequencing , 2011, BMC Genomics.

[22]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[23]  Philip Hugenholtz,et al.  Shining a Light on Dark Sequencing: Characterising Errors in Ion Torrent PGM Data , 2013, PLoS Comput. Biol..

[24]  W Kelley Thomas,et al.  The consequences of denoising marker-based metagenomic data , 2012, BMC Proceedings.

[25]  Axel K. Hansen,et al.  Quantitatively Different, yet Qualitatively Alike: A Meta-Analysis of the Mouse Core Gut Microbiome with a View towards the Human Gut Microbiome , 2013, PloS one.

[26]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.