Error filtering, pair assembly and error correction for next-generation sequencing reads

MOTIVATION Next-generation sequencing produces vast amounts of data with errors that are difficult to distinguish from true biological variation when coverage is low. RESULTS We demonstrate large reductions in error frequencies, especially for high-error-rate reads, by three independent means: (i) filtering reads according to their expected number of errors, (ii) assembling overlapping read pairs and (iii) for amplicon reads, by exploiting unique sequence abundances to perform error correction. We also show that most published paired read assemblers calculate incorrect posterior quality scores. AVAILABILITY AND IMPLEMENTATION These methods are implemented in the USEARCH package. Binaries are freely available at http://drive5.com/usearch. CONTACT robert@drive5.com SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[2]  Nicholas A. Bokulich,et al.  Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing , 2012, Nature Methods.

[3]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[4]  S. M. Samuels On the Number of Successes in Independent Trials , 1965 .

[5]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[6]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[7]  T. Glenn Field guide to next‐generation DNA sequencers , 2011, Molecular ecology resources.

[8]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[9]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[10]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[11]  R. Knight,et al.  Rapid denoising of pyrosequencing amplicon data: exploiting the rank-abundance distribution , 2010, Nature Methods.

[12]  Susan M. Huse,et al.  Ironing out the wrinkles in the rare biosphere through improved OTU clustering , 2010, Environmental microbiology.

[13]  Sallie W. Chisholm,et al.  Unlocking Short Read Sequencing for Metagenomics , 2010, PloS one.

[14]  L. L. Cam,et al.  An approximation theorem for the Poisson binomial distribution. , 1960 .

[15]  Robert C. Edgar,et al.  UPARSE: highly accurate OTU sequences from microbial amplicon reads , 2013, Nature Methods.

[16]  Patrick D. Schloss,et al.  Reducing the Effects of PCR Amplification and Sequencing Artifacts on 16S rRNA-Based Studies , 2011, PloS one.

[17]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[18]  Russell J. Davenport,et al.  Removing Noise From Pyrosequenced Amplicons , 2011, BMC Bioinformatics.

[19]  B. Haas,et al.  Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. , 2011, Genome research.

[20]  Ann Vanreusel,et al.  Host-specificity among abundant and rare taxa in the sponge microbiome , 2014, The ISME Journal.

[21]  Jiajie Zhang,et al.  PEAR: a fast and accurate Illumina Paired-End reAd mergeR , 2013, Bioinform..

[22]  Mark M. Davis,et al.  Lineage Structure of the Human Antibody Repertoire in Response to Influenza Vaccination , 2013, Science Translational Medicine.