Using QC-Blind for Quality Control and Contamination Screening of Bacteria DNA Sequencing Data Without Reference Genome

Quality control in next generation sequencing has become increasingly important as the technique becomes widely used. Tools have been developed for filtering possible contaminants in the sequencing data of species with known reference genome. Unfortunately, reference genomes for all the species involved, including the contaminants, are required for these tools to work. This precludes many real-life samples that have no information about the complete genome of the target species, and are contaminated with unknown microbial species. In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. The pipeline requires only very little information from the marker genes of the target species. The entire pipeline consists of unsupervised read assembly, contig binning, read clustering and marker gene assignment. When evaluated on in silico, ab initio and in vivo datasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind could serve well in situations where limited information is available for both target and contamination species. IMPORTANCE At present, many sequencing projects are still performed on potentially contaminated samples, which bring into question their accuracies. However, current reference-based quality control method are limited as they need either the genome of target species or contaminations. In this work we propose QC-Blind, a novel quality control pipeline for removing contaminants without any use of reference genomes. When evaluated on in silico, ab initio and in vivo datasets, QC-Blind proved effective in removing unknown contaminants with high specificity and accuracy, while preserving most of the genomic information of the target bacterial species. Therefore, QC-Blind is suitable for real-life samples where limited information is available for both target and contamination species.

[1]  Norman R. Pace,et al.  Specific Ribosomal DNA Sequences from Diverse Environmental Settings Correlate with Experimental Contaminants , 1998, Applied and Environmental Microbiology.

[2]  M. Blaxter,et al.  Genome-wide genetic marker discovery and genotyping using next-generation sequencing , 2011, Nature Reviews Genetics.

[3]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[4]  M. Strous,et al.  The Binning of Metagenomic Contigs for Microbial Physiology of Mixed Cultures , 2012, Front. Microbio..

[5]  Rita R. Colwell,et al.  Microbial Community Profiling of Human Saliva Using Shotgun Metagenomic Sequencing , 2014, PloS one.

[6]  Jian Xu,et al.  QC-Chain: Fast and Holistic Quality Control Method for Next-Generation Sequencing Data , 2013, PloS one.

[7]  G. Scambia,et al.  A preliminary Quality Control (QC) for next generation sequencing (NGS) library evaluation turns out to be a very useful tool for a rapid detection of BRCA1/2 deleterious mutations. , 2014, Clinica chimica acta; international journal of clinical chemistry.

[8]  Paul P. Gardner,et al.  An evaluation of the accuracy and speed of metagenome analysis tools , 2015, Scientific Reports.

[9]  Zhen Lin,et al.  Microbial Contamination in Next Generation Sequencing: Implications for Sequence-Based Analysis of Clinical Samples , 2014, PLoS pathogens.

[10]  Xuegang Luo,et al.  Metagenomic analysis of microbial community in uranium-contaminated soil , 2015, Applied Microbiology and Biotechnology.

[11]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[12]  Yue Han,et al.  AfterQC: automatic filtering, trimming, error removing and quality control for fastq data , 2017, BMC Bioinformatics.

[13]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[14]  Chaochun Wei,et al.  NeSSM: A Next-Generation Sequencing Simulator for Metagenomics , 2013, PloS one.

[15]  Robert P. Davey,et al.  Sequencing quality assessment tools to enable data-driven informatics for high throughput genomics , 2013, Front. Genet..

[16]  Peter H. Janssen,et al.  Effect of DNA Extraction Methods and Sampling Techniques on the Apparent Structure of Cow and Sheep Rumen Microbial Communities , 2013, PloS one.

[17]  Lee Ann McCue,et al.  FQC Dashboard: integrates FastQC results into a web-based, interactive, and extensible FASTQ quality control tool , 2017, Bioinform..

[18]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[19]  R. Daniel,et al.  Metagenomic Analyses: Past and Future Trends , 2010, Applied and Environmental Microbiology.

[20]  Se Jin Song,et al.  Tracking down the sources of experimental contamination in microbiome studies , 2014, Genome Biology.

[21]  Yan Guo,et al.  Three-stage quality control strategies for DNA re-sequencing data , 2014, Briefings Bioinform..

[22]  A. Bressan,et al.  Ultrastructural detection of an unusual intranuclear bacterium in Pentastiridius leporinus (Hemiptera: Cixiidae). , 2008, Journal of invertebrate pathology.

[23]  Paul O'Neill,et al.  Quality control on the frontier , 2014, Front. Genet..

[24]  Woojun Park,et al.  Metagenomic and functional analyses of the consequences of reduction of bacterial diversity on soil functions and bioremediation in diesel-contaminated microcosms , 2016, Scientific Reports.

[25]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[26]  Lauren C. Cline,et al.  Anthropogenic N Deposition Slows Decay by Favoring Bacterial Metabolism: Insights from Metagenomic Analyses , 2016, Front. Microbiol..

[27]  Kang Ning,et al.  Biological ingredient analysis of traditional Chinese medicine preparation based on high-throughput sequencing: the story for Liuwei Dihuang Wan , 2014, Scientific Reports.

[28]  E. Birney,et al.  Velvet: algorithms for de novo short read assembly using de Bruijn graphs. , 2008, Genome research.

[29]  Duy Tin Truong,et al.  MetaPhlAn2 for enhanced metagenomic taxonomic profiling , 2015, Nature Methods.

[30]  Jian Xu,et al.  DNA Extraction Protocol for Biological Ingredient Analysis of Liuwei Dihuang Wan , 2014, Genom. Proteom. Bioinform..

[31]  A. Goffeau,et al.  The complete genome sequence of the Gram-positive bacterium Bacillus subtilis , 1997, Nature.

[32]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[33]  Rajeev K. Varshney,et al.  NGS-QCbox and Raspberry for Parallel, Automated and Rapid Quality Control Analysis of Large-Scale Next Generation Sequencing (Illumina) Data , 2015, PloS one.

[34]  Kang Ning,et al.  Parallel-META 2.0: Enhanced Metagenomic Data Analysis with Functional Annotation, High Performance Computing and Advanced Visualization , 2014, PloS one.

[35]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[36]  Alexandros Stamatakis,et al.  Metagenomic species profiling using universal phylogenetic marker genes , 2013, Nature Methods.

[37]  Rob Knight,et al.  Bayesian community-wide culture-independent microbial source tracking , 2011, Nature Methods.

[38]  Quality Control Procedures for High-Throughput Genetic Association Studies. , 2015, Methods in molecular biology.

[39]  Olivia I. Koues,et al.  Multi-perspective quality control of Illumina RNA sequencing data analysis , 2016, Briefings in functional genomics.

[40]  D. Cowan,et al.  Metagenomic analysis provides insights into functional capacity in a hyperarid desert soil niche community. , 2016, Environmental microbiology.

[41]  Bernard R. Baum,et al.  Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components , 1997, Plant Molecular Biology Reporter.

[42]  Qiang Feng,et al.  Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer , 2015, Gut.

[43]  Kunihiko Sadakane,et al.  MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph , 2014, Bioinform..