Contaminant DNA in bacterial sequencing experiments is a major source of false genetic variability

Background Contaminant DNA is a well-known confounding factor in molecular biology and in genomic repositories. Strikingly, analysis workflows for whole-genome sequencing (WGS) data commonly do not account for errors potentially introduced by contamination, which could lead to the wrong assessment of allele frequency both in basic and clinical research. Results We used a taxonomic filter to remove contaminant reads from more than 4000 bacterial samples from 20 different studies and performed a comprehensive evaluation of the extent and impact of contaminant DNA in WGS. We found that contamination is pervasive and can introduce large biases in variant analysis. We showed that these biases can result in hundreds of false positive and negative SNPs, even for samples with slight contamination. Studies investigating complex biological traits from sequencing data can be completely biased if contamination is neglected during the bioinformatic analysis, and we demonstrate that removing contaminant reads with a taxonomic classifier permits more accurate variant calling. We used both real and simulated data to evaluate and implement reliable, contamination-aware analysis pipelines. Conclusion As sequencing technologies consolidate as precision tools that are increasingly adopted in the research and clinical context, our results urge for the implementation of contamination-aware analysis pipelines. Taxonomic classifiers are a powerful tool to implement such pipelines.

[1]  M. Pinheiro,et al.  Genome-scale analysis of the non-cultivable Treponema pallidum reveals extensive within-patient genetic variation , 2016, Nature Microbiology.

[2]  Jonathan Wilksch,et al.  Genomic analysis of diversity, population structure, virulence, and antimicrobial resistance in Klebsiella pneumoniae, an urgent threat to public health , 2015, Proceedings of the National Academy of Sciences.

[3]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[4]  S. Sheppard,et al.  Population genomics of bacterial host adaptation , 2018, Nature Reviews Genetics.

[5]  Julian Parkhill,et al.  Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study , 2018, Wellcome open research.

[6]  Nicolas Faivre,et al.  Patterns of cross-contamination in a multispecies population genomic project: detection, quantification, impact, and solutions , 2017, BMC Biology.

[7]  Calin B Chiribau,et al.  Direct transmission of within-host Mycobacterium tuberculosis diversity to secondary cases can lead to variable between-host heterogeneity without de novo mutation: A genomic investigation , 2019, EBioMedicine.

[8]  Jukka Corander,et al.  Whole-Genome Sequencing for Routine Pathogen Surveillance in Public Health: a Population Snapshot of Invasive Staphylococcus aureus in Europe , 2016, mBio.

[9]  T. Dallman,et al.  Evaluation of Whole-Genome Sequencing for Identification and Typing of Vibrio cholerae , 2018, Journal of Clinical Microbiology.

[10]  Thomas Abeel,et al.  Evolution of Extensively Drug-Resistant Tuberculosis over Four Decades: Whole Genome Sequencing and Dating Analysis of Mycobacterium tuberculosis Isolates from KwaZulu-Natal , 2015, PLoS medicine.

[11]  Phelim Bradley,et al.  Same-day diagnostic and surveillance data for tuberculosis via whole genome sequencing of direct respiratory samples , 2016 .

[12]  Daniel Standage,et al.  The khmer software package: enabling efficient nucleotide sequence analysis , 2015, F1000Research.

[13]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[14]  Stefano Lonardi,et al.  Comprehensive benchmarking and ensemble approaches for metagenomic classifiers , 2017, Genome Biology.

[15]  Stephen J. Salipante,et al.  A Year of Infection in the Intensive Care Unit: Prospective Whole Genome Sequencing of Bacterial Clinical Isolates Reveals Cryptic Transmissions and Novel Microbiota , 2015, PLoS genetics.

[16]  J. Shendure,et al.  Correction: A Year of Infection in the Intensive Care Unit: Prospective Whole Genome Sequencing of Bacterial Clinical Isolates Reveals Cryptic Transmissions and Novel Microbiota , 2017, PLoS genetics.

[17]  K. Yahara,et al.  Genomic surveillance of Neisseria gonorrhoeae to investigate the distribution and evolution of antimicrobial-resistance determinants and lineages , 2018, Microbial genomics.

[18]  P. McDermott,et al.  Whole-genome sequencing based characterization of antimicrobial resistance in Enterococcus , 2018, Pathogens and disease.

[19]  A. Mellmann,et al.  Whole-Genome Sequencing Elucidates Epidemiology of Nosocomial Clusters of Acinetobacter baumannii , 2016, Journal of Clinical Microbiology.

[20]  P. Beckert,et al.  PhyResSE: a Web Tool Delineating Mycobacterium tuberculosis Antibiotic Resistance and Lineage from Whole-Genome Sequencing Data , 2015, Journal of Clinical Microbiology.

[21]  C. Cooper,et al.  SEPATH: benchmarking the search for pathogens in human tissue whole genome sequence data leads to template pipelines , 2019, Genome Biology.

[22]  M. McConnell,et al.  Using bacterial genomes and essential genes for the development of new antibiotics , 2017, Biochemical pharmacology.

[23]  Steven Salzberg,et al.  Bracken: Estimating species abundance in metagenomics data , 2016, bioRxiv.

[24]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[25]  Marc L. Salit,et al.  Best practices for evaluating single nucleotide variant calling methods for microbial genomics , 2015, Front. Genet..

[26]  Thibaut Jombart,et al.  When are pathogen genome sequences informative of transmission events? , 2018, PLoS pathogens.

[27]  Christopher G. Wilson,et al.  Cross-Contamination Explains “Inter and Intraspecific Horizontal Genetic Transfers” between Asexual Bdelloid Rotifers , 2018, Current Biology.

[28]  E. Litrup,et al.  Investigation of Outbreaks of Salmonella enterica Serovar Typhimurium and Its Monophasic Variants Using Whole-Genome Sequencing, Denmark , 2017, Emerging infectious diseases.

[29]  Florian P Breitwieser,et al.  Human contamination in bacterial genomes has created thousands of spurious proteins , 2019, Genome research.

[30]  Daniel J. Wilson,et al.  Transforming clinical microbiology with bacterial genome sequencing , 2012, Nature Reviews Genetics.

[31]  D. Falush Bacterial genomics: Microbial GWAS coming of age , 2016, Nature Microbiology.

[32]  Stephan Fuchs,et al.  Whole-Genome Sequencing of Recent Listeria monocytogenes Isolates from Germany Reveals Population Structure and Disease Clusters , 2018, Journal of Clinical Microbiology.

[33]  P. Keim,et al.  More than 50% of Clostridium difficile Isolates from Pet Dogs in Flagstaff, USA, Carry Toxigenic Genotypes , 2016, PloS one.

[34]  W. Hanage,et al.  Within-host Mycobacterium tuberculosis diversity and its utility for inferences of transmission , 2018, Microbial genomics.

[35]  Gang Sun,et al.  The within-host population dynamics of Mycobacterium tuberculosis vary with treatment efficacy , 2017, Genome Biology.

[36]  Jukka Corander,et al.  Bayesian identification of bacterial strains from sequencing data , 2015, Microbial genomics.

[37]  Tanja Stadler,et al.  The relationship between transmission time and clustering methods in Mycobacterium tuberculosis epidemiology , 2018, bioRxiv.

[38]  G. Myers,et al.  Culture-Independent Genome Sequencing of Clinical Samples Reveals an Unexpected Heterogeneity of Infections by Chlamydia pecorum , 2015, Journal of Clinical Microbiology.

[39]  M. Christiansen,et al.  Whole-Genome Enrichment Using RNA Probes and Sequencing of Chlamydia trachomatis Directly from Clinical Samples. , 2017, Methods in molecular biology.

[40]  Paul Turner,et al.  Reagent and laboratory contamination can critically impact sequence-based microbiome analyses , 2014, BMC Biology.

[41]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[42]  J. Parkhill,et al.  Genomic perspectives on the evolution and spread of bacterial pathogens , 2015, Proceedings of the Royal Society B: Biological Sciences.

[43]  Steven Salzberg,et al.  Removing contaminants from databases of draft genomes , 2018, PLoS Comput. Biol..

[44]  Victor I Band,et al.  Heteroresistance: A cause of unexplained antibiotic treatment failure? , 2019, PLoS pathogens.

[45]  R. Lenski,et al.  Experimental evolution and the dynamics of adaptation and genome evolution in microbial populations , 2017, The ISME Journal.

[46]  Mark Blaxter,et al.  BlobTools: Interrogation of genome assemblies , 2017, F1000Research.

[47]  O. Delattre,et al.  ART-DeCo: easy tool for detection and characterization of cross-contamination of DNA samples in diagnostic next-generation sequencing analysis , 2019, European Journal of Human Genetics.

[48]  Paul R McAdam,et al.  High-throughput sequencing for the study of bacterial pathogen biology , 2014, Current opinion in microbiology.

[49]  M. Pallen,et al.  Whole-genome sequencing illuminates the evolution and spread of multidrug-resistant tuberculosis in Southwest Nigeria , 2017, PloS one.

[50]  Jolyon Holdstock,et al.  Rapid Whole-Genome Sequencing of Mycobacterium tuberculosis Isolates Directly from Clinical Samples , 2015, Journal of Clinical Microbiology.

[51]  Leping Li,et al.  ART: a next-generation sequencing read simulator , 2012, Bioinform..

[52]  Thomas R Rogers,et al.  Rapid, comprehensive, and affordable mycobacterial diagnosis with whole-genome sequencing: a prospective study , 2016, The Lancet. Respiratory medicine.

[53]  Andrew J. Oler,et al.  Whole-Genome Sequencing of Mycobacterium tuberculosis Provides Insight into the Evolution and Genetic Composition of Drug-Resistant Tuberculosis in Belarus , 2016, Journal of Clinical Microbiology.

[54]  V. Sintchenko,et al.  Genome Sequencing Links Persistent Outbreak of Legionellosis in Sydney (New South Wales, Australia) to an Emerging Clone of Legionella pneumophila Sequence Type 211 , 2017, Applied and Environmental Microbiology.

[55]  J. Bryant,et al.  Direct Whole-Genome Sequencing of Sputum Accurately Identifies Drug-Resistant Mycobacterium tuberculosis Faster than MGIT Culture Sequencing , 2018, Journal of Clinical Microbiology.

[56]  Pardis C. Sabeti,et al.  Benchmarking Metagenomics Tools for Taxonomic Classification , 2019, Cell.

[57]  J. Shendure,et al.  Whole-Genome Sequencing for High-Resolution Investigation of Methicillin-Resistant Staphylococcus aureus Epidemiology and Genome Plasticity , 2014, Journal of Clinical Microbiology.

[58]  S. Molin,et al.  Convergent evolution and adaptation of Pseudomonas aeruginosa within patients with cystic fibrosis , 2014, Nature Genetics.

[59]  A. Ho Evolution of Peripheral Blood Stem Cell Transplantation. , 2019, Methods in molecular biology.

[60]  Daniel J. Wilson,et al.  Within-host evolution of bacterial pathogens , 2016, Nature Reviews Microbiology.