Contaminations in (meta)genome data: An open issue for the scientific community

In recent years, the high throughput and the low cost of next‐generation sequencing (NGS) technologies have led to an increase of the amount of (meta)genomic data, revolutionizing genomic research studies. However, the quality of sequencing data could be affected by experimental errors derived from defective methods and protocols. This represents a serious problem for the scientific community with a negative impact on the correctness of studies that involve genomic sequence analysis. As a countermeasure, several alignment and taxonomic classification tools have been developed to uncover and correct errors. In this critical review some of these integrated software tools and pipelines used to detect contaminations in reference genome databases and sequenced samples are reported. In particular, case studies of bacterial contaminations, contaminations of human origin, mitochondrial contaminations of ancient DNA, and cross contaminations are examined.

[1]  P. Ascenzi,et al.  No lanthanides‐based catalysis in eukaryotes , 2018, IUBMB life.

[2]  J. D. Watson,et al.  Human Genome Project: Twenty-five years of big biology , 2015, Nature.

[3]  Louxin Zhang,et al.  WebPHYLIP: a web interface to PHYLIP , 1999, Bioinform..

[4]  Jean Thierry-Mieg,et al.  Magic-BLAST, an accurate RNA-seq aligner for long and short reads , 2019, BMC Bioinformatics.

[5]  I. Longden,et al.  EMBOSS: the European Molecular Biology Open Software Suite. , 2000, Trends in genetics : TIG.

[6]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[7]  H. Philippe,et al.  Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough , 2011, PLoS biology.

[8]  Zhen Lin,et al.  Microbial Contamination in Next Generation Sequencing: Implications for Sequence-Based Analysis of Clinical Samples , 2014, PLoS pathogens.

[9]  Richard Durbin,et al.  Fast and accurate long-read alignment with Burrows–Wheeler transform , 2010, Bioinform..

[10]  Sudhir Kumar,et al.  MEGA X: Molecular Evolutionary Genetics Analysis across Computing Platforms. , 2018, Molecular biology and evolution.

[11]  Pierre Baldi,et al.  An enhanced MITOMAP with a global mtDNA mutational phylogeny , 2006, Nucleic Acids Res..

[12]  J. Rink,et al.  A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data , 2018, BMC Biology.

[13]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[14]  R. Percudani A Microbial Metagenome (Leucobacter sp.) in Caenorhabditis Whole Genome Sequences , 2013, Bioinformatics and biology insights.

[15]  Steven Salzberg,et al.  Removing contaminants from databases of draft genomes , 2018, PLoS Comput. Biol..

[16]  Alexander F. Auch,et al.  Metagenomics to Paleogenomics: Large-Scale Sequencing of Mammoth DNA , 2006, Science.

[17]  L. Weyrich,et al.  Laboratory contamination over time during low‐biomass sample analysis , 2018, bioRxiv.

[18]  António Amorim,et al.  Mitochondrial DNA in human identification: a review , 2019, PeerJ.

[19]  Catherine D. Carrillo,et al.  ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data , 2019, PeerJ.

[20]  T. Burmester,et al.  Parasite infection of public databases: a data mining approach to identify apicomplexan contaminations in animal genome and transcriptome assemblies , 2017, BMC Genomics.

[21]  Tatiana A. Tatusova,et al.  RefSeq microbial genomes database: new representation and annotation strategy , 2013, Nucleic Acids Res..

[22]  Ram Vinay Pandey,et al.  ClinQC: a tool for quality control and cleaning of Sanger and NGS data in clinical research , 2016, BMC Bioinformatics.

[23]  Vladimir Vacic,et al.  Conpair: concordance and contamination estimator for matched tumor–normal pairs , 2016, Bioinform..

[24]  C. Huttenhower,et al.  Metagenomic microbial community profiling using unique clade-specific marker genes , 2012, Nature Methods.

[25]  Walther Parson,et al.  EMPOP--a forensic mtDNA database. , 2007, Forensic science international. Genetics.

[26]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[27]  E. Eichler,et al.  Characterization of Missing Human Genome Sequences and Copy-number Polymorphic Insertions , 2010, Nature Methods.

[28]  S. Salzberg,et al.  Contamination in the Draft of the Human Genome Masquerades As Lateral Gene Transfer , 2002, DNA Sequence.

[29]  Rob Knight,et al.  Supervised classification of microbiota mitigates mislabeling errors , 2011, The ISME Journal.

[30]  Sung-Bae Cho,et al.  mtDNAmanager: a Web-based tool for the management and quality analysis of mitochondrial DNA control-region sequences , 2008, BMC Bioinformatics.

[31]  Nicolas Faivre,et al.  Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions , 2017, BMC Biology.

[32]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[33]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[34]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[35]  S. Salzberg,et al.  Using MUMmer to Identify Similar Regions in Large Sequence Sets , 2003, Current protocols in bioinformatics.

[36]  Julie M. Allen,et al.  Ancient DNA from a 2,500-year-old Caribbean fossil places an extinct bird (Caracara creightoni) in a phylogenetic context. , 2019, Molecular phylogenetics and evolution.

[37]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[38]  P. Ascenzi,et al.  Lanthanides‐based catalysis in eukaryotes , 2018, IUBMB life.

[39]  Douglas E. Brash,et al.  Common Contaminants in Next-Generation Sequencing That Hinder Discovery of Low-Abundance Microbes , 2014, PloS one.

[40]  J. Krause,et al.  Ratio of mitochondrial to nuclear DNA affects contamination estimates in ancient DNA analysis , 2018, Scientific Reports.

[41]  Huub J. M. Op den Camp,et al.  PQQ-dependent methanol dehydrogenases: rare-earth elements make a difference , 2014, Applied Microbiology and Biotechnology.

[42]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[43]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[44]  Shengrui Wang,et al.  A new method for decontamination of de novo transcriptomes using a hierarchical clustering algorithm , 2016, Bioinform..

[45]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[46]  C. Huttenhower,et al.  The microbiome quality control project: baseline study design and future directions , 2015, Genome Biology.

[47]  K. Frazer,et al.  Human genetic variation and its contribution to complex traits , 2009, Nature Reviews Genetics.

[48]  Janet Kelso,et al.  Schmutzi: estimation of contamination and endogenous mitochondrial consensus calling for ancient DNA , 2015, Genome Biology.

[49]  Hyun Min Kang,et al.  Correcting for Sample Contamination in Genotype Calling of DNA Sequence Data. , 2015, American journal of human genetics.

[50]  Jordan M. Eizenga,et al.  A phylogenetic approach for haplotype analysis of sequence data from complex mitochondrial mixtures. , 2017, Forensic science international. Genetics.

[51]  Gabor T. Marth,et al.  MOSAIK: A Hash-Based Algorithm for Accurate Next-Generation Sequencing Short-Read Mapping , 2013, PloS one.

[52]  Alejandro A. Schäffer,et al.  Database indexing for production MegaBLAST searches , 2008, Bioinform..

[53]  Keith Dobney,et al.  Sequencing ancient calcified dental plaque shows changes in oral microbiota with dietary shifts of the Neolithic and Industrial revolutions , 2013, Nature Genetics.

[54]  M. Nei,et al.  MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. , 2011, Molecular biology and evolution.

[55]  Philip L. F. Johnson,et al.  Patterns of damage in genomic DNA sequences from a Neandertal , 2007, Proceedings of the National Academy of Sciences.

[56]  Martin Kircher,et al.  A Complete mtDNA Genome of an Early Modern Human from Kostenki, Russia , 2010, Current Biology.

[57]  Julian Parkhill,et al.  Recognizing the reagent microbiome , 2018, Nature Microbiology.

[58]  Tadashi Imanishi,et al.  Human Contamination in Public Genome Assemblies , 2016, PloS one.

[59]  M. Shimada,et al.  A modification of the PHYLIP program: A solution for the redundant cluster problem, and an implementation of an automatic bootstrapping on trees inferred from original data. , 2017, Molecular phylogenetics and evolution.

[60]  A. Salamov,et al.  Use of simulated data sets to evaluate the fidelity of metagenomic processing methods , 2007, Nature Methods.

[61]  James F. Meadow,et al.  Microbiota of the indoor environment: a meta-analysis , 2015, Microbiome.

[62]  M. Blaxter,et al.  Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots , 2013, Front. Genet..

[63]  S. Salzberg,et al.  Phymm and PhymmBL: Metagenomic Phylogenetic Classification with Interpolated Markov Models , 2009, Nature Methods.

[64]  Mark R. Wilson,et al.  Validation of mitochondrial DNA sequencing for forensic casework analysis , 2005, International Journal of Legal Medicine.

[65]  S. Hughes,et al.  Multiple Sources of Contamination in Samples from Patients Reported to Have XMRV Infection , 2012, PloS one.

[66]  J. Jurka,et al.  Repbase Update, a database of eukaryotic repetitive elements , 2005, Cytogenetic and Genome Research.

[67]  Jean Claude Zenklusen,et al.  A Practical Guide to The Cancer Genome Atlas (TCGA) , 2016, Statistical Genomics.

[68]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.