SAG-QC: quality control of single amplified genome information by subtracting non-target sequences based on sequence compositions

BackgroundWhole genome amplification techniques have enabled the analysis of unexplored genomic information by sequencing of single-amplified genomes (SAGs). Whole genome amplification of single bacteria is currently challenging because contamination often occurs in experimental processes. Thus, to increase the confidence in the analyses of sequenced SAGs, bioinformatics approaches that identify and exclude non-target sequences from SAGs are required. Since currently reported approaches utilize sequence information in public databases, they have limitations when new strains are the targets of interest. Here, we developed a software SAG-QC that identify and exclude non-target sequences independent of database.ResultsIn our method, “no template control” sequences acquired during WGA were used. We calculated the probability that a sequence was derived from contaminants by comparing k-mer compositions with the no template control sequences. Based on the results of tests using simulated SAG datasets, the accuracy of our method for predicting non-target sequences was higher than that of currently reported techniques. Subsequently, we applied our tool to actual SAG datasets and evaluated the accuracy of the predictions.ConclusionsOur method works independently of public sequence information for distinguishing SAGs from non-target sequences. This method will be effective when employed against SAG sequences of unexplored strains and we anticipate that it will contribute to the correct interpretation of SAGs.

[1]  Jeremiah J Minich,et al.  Improved Multiple Displacement Amplification (iMDA) and Ultraclean Reagents , 2014, BMC Genomics.

[2]  Robert A. Edwards,et al.  Quality control and preprocessing of metagenomic datasets , 2011, Bioinform..

[3]  Hamidreza Chitsaz,et al.  Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum , 2013, Proceedings of the National Academy of Sciences.

[4]  R. Lasken,et al.  Genomic DNA Amplification from a Single Bacterium , 2005, Applied and Environmental Microbiology.

[5]  J. Martínez,et al.  Natural Antibiotic Resistance and Contamination by Antibiotic Resistance Determinants: The Two Ages in the Evolution of Resistance to Antimicrobials , 2012, Front. Microbio..

[6]  David W. Scott,et al.  Multivariate Density Estimation: Theory, Practice, and Visualization , 1992, Wiley Series in Probability and Statistics.

[7]  Alexander Sczyrba,et al.  Decontamination of MDA Reagents for Single Cell Whole Genome Amplification , 2011, PloS one.

[8]  S. Quake,et al.  Dissecting biological “dark matter” with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth , 2007, Proceedings of the National Academy of Sciences.

[9]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[10]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[11]  S. Salzberg,et al.  Alignment of whole genomes. , 1999, Nucleic acids research.

[12]  Karthik Anantharaman,et al.  Metagenomic resolution of microbial functions in deep-sea hydrothermal plumes across the Eastern Lau Spreading Center , 2015, The ISME Journal.

[13]  C. Hutchison,et al.  Cell-free cloning using φ29 DNA polymerase , 2005 .

[14]  X. Xie,et al.  Genome-Wide Detection of Single-Nucleotide and Copy-Number Variations of a Single Human Cell , 2012, Science.

[15]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[16]  M. Strous,et al.  The Binning of Metagenomic Contigs for Microbial Physiology of Mixed Cultures , 2012, Front. Microbio..

[17]  Paul C. Blainey,et al.  Digital MDA for enumeration of total nucleic acid contamination , 2010, Nucleic acids research.

[18]  Sergey I. Nikolenko,et al.  SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing , 2012, J. Comput. Biol..

[19]  T. Itoh,et al.  MetaGeneAnnotator: Detecting Species-Specific Patterns of Ribosomal Binding Site for Precise Gene Prediction in Anonymous Prokaryotic and Phage Genomes , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[20]  Xavier Robin,et al.  pROC: an open-source package for R and S+ to analyze and compare ROC curves , 2011, BMC Bioinformatics.

[21]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[22]  Jonathan Dushoff,et al.  Unsupervised statistical clustering of environmental shotgun sequences , 2009, BMC Bioinformatics.

[23]  Sijia Lu,et al.  Microfluidic whole genome amplification device for single cell sequencing. , 2014, Analytical chemistry.

[24]  Natalia N. Ivanova,et al.  Insights into the phylogeny and coding potential of microbial dark matter , 2013, Nature.

[25]  Christian Rinke,et al.  An environmental bacterial taxon with a large and distinct metabolic repertoire , 2014, Nature.

[26]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[27]  Steven Salzberg,et al.  Clustering metagenomic sequences with interpolated Markov models , 2010, BMC Bioinformatics.

[28]  Evan Andersen,et al.  ProDeGe: a computational protocol for fully automated decontamination of genomes , 2015, The ISME Journal.

[29]  M. Podar,et al.  Single Cell Genomics of Uncultured, Health-Associated Tannerella BU063 (Oral Taxon 286) and Comparison to the Closely Related Pathogen Tannerella forsythia , 2014, PloS one.

[30]  Paul M. Sharp,et al.  Codon usage in yeast: cluster analysis clearly differentiates highly and lowly expressed genes , 1986, Nucleic Acids Res..