MeFiT: merging and filtering tool for illumina paired-end reads for 16S rRNA amplicon sequencing

BackgroundRecent advances in next-generation sequencing have revolutionized genomic research. 16S rRNA amplicon sequencing using paired-end sequencing on the MiSeq platform from Illumina, Inc., is being used to characterize the composition and dynamics of extremely complex/diverse microbial communities. For this analysis on the Illumina platform, merging and quality filtering of paired-end reads are essential first steps in data analysis to ensure the accuracy and reliability of downstream analysis.ResultsWe have developed the Merging and Filtering Tool (MeFiT) to combine these pre-processing steps into one simple, intuitive pipeline. MeFiT invokes CASPER (context-aware scheme for paired-end reads) for merging paired-end reads and provides users the option to quality filter the reads using the traditional average Q-score metric or using a maximum expected error cut-off threshold.ConclusionsMeFiT provides an open-source solution that permits users to merge and filter paired end illumina reads. The tool has been implemented in python and the source-code is freely available at https://github.com/nisheth/MeFiT.

[1]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[2]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[3]  M. Gerstein,et al.  PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data , 2009, Genome Biology.

[4]  Sarah L. Westcott,et al.  Development of a Dual-Index Sequencing Strategy and Curation Pipeline for Analyzing Amplicon Sequence Data on the MiSeq Illumina Sequencing Platform , 2013, Applied and Environmental Microbiology.

[5]  Yuriy Fofanov,et al.  PIQA: pipeline for Illumina G1 genome analyzer data quality assessment , 2009, Bioinform..

[6]  Song Liu,et al.  FUSIM: a software tool for simulating fusion transcripts , 2013, BMC Bioinformatics.

[7]  Mukesh Jain,et al.  NGS QC Toolkit: A Toolkit for Quality Control of Next Generation Sequencing Data , 2012, PloS one.

[8]  Daniel G. Brown,et al.  PANDAseq: paired-end assembler for illumina sequences , 2012, BMC Bioinformatics.

[9]  Seth B. Roberts,et al.  The Vaginal Microbiome: Disease, Genetics and the Environment , 2010, Nature Precedings.

[10]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[11]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[12]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[13]  Heather K. Allen,et al.  Pipeline for amplifying and analyzing amplicons of the V1–V3 region of the 16S rRNA gene , 2016, BMC Research Notes.

[14]  A. Butte,et al.  The Integrative Human Microbiome Project: Dynamic Analysis of Microbiome-Host Omics Profiles during Periods of Human Health and Disease , 2014, Cell host & microbe.

[15]  Robert C. Edgar,et al.  Error filtering, pair assembly and error correction for next-generation sequencing reads , 2015, Bioinform..

[16]  E. Purdom,et al.  Diversity of the Human Intestinal Microbial Flora , 2005, Science.

[17]  M. Blaser,et al.  Molecular analysis of human forearm superficial skin bacterial biota , 2007, Proceedings of the National Academy of Sciences.

[18]  E. Mardis Next-generation DNA sequencing methods. , 2008, Annual review of genomics and human genetics.

[19]  Jennifer M. Fettweis,et al.  Species-level classification of the vaginal microbiome , 2012, BMC Genomics.

[20]  Peter M. Rice,et al.  The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants , 2009, Nucleic acids research.

[21]  Sallie W. Chisholm,et al.  Unlocking Short Read Sequencing for Metagenomics , 2010, PloS one.

[22]  Patrick J. Biggs,et al.  SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data , 2010, BMC Bioinformatics.

[23]  Hardik I. Parikh,et al.  Skin-to-Skin Care and the Development of the Preterm Infant Oral Microbiome , 2015, American Journal of Perinatology.

[24]  Byunghan Lee,et al.  CASPER: context-aware scheme for paired-end reads from high-throughput amplicon sequencing , 2014, BMC Bioinformatics.

[25]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[26]  Steven Salzberg,et al.  BIOINFORMATICS ORIGINAL PAPER , 2004 .

[27]  Anton Nekrutenko,et al.  Manipulation of FASTQ data with Galaxy , 2010, Bioinform..

[28]  Kessy Abarenkov,et al.  V-Xtractor: an open-source, high-throughput software tool to identify and extract hypervariable regions of small subunit (16S/18S) ribosomal RNA gene sequences. , 2010, Journal of microbiological methods.

[29]  Siu-Ming Yiu,et al.  COPE: an accurate k-mer-based pair-end reads connection tool to facilitate genome assembly , 2012, Bioinform..

[30]  Jennifer M. Fettweis,et al.  Differences in vaginal microbiome in African American women versus women of European ancestry. , 2014, Microbiology.

[31]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[32]  Ronald W. Davis,et al.  Microbes on the human vaginal epithelium , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Florent E. Angly,et al.  Grinder: a versatile amplicon and shotgun sequence simulator , 2012, Nucleic acids research.