Evaluation of methods for detecting human reads in microbial sequencing datasets

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

[1]  Jason Stenson,et al.  Humans differ in their personal microbial cloud , 2015, PeerJ.

[2]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[3]  Anton J. Enright,et al.  Kraken: A set of tools for quality control and analysis of high-throughput sequence data , 2013, Methods.

[4]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[5]  Gary L. Gallia,et al.  Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system , 2016, Neurology: Neuroimmunology & Neuroinflammation.

[6]  Derrick E. Wood,et al.  Improved metagenomic analysis with Kraken 2 , 2019, Genome Biology.

[7]  S. Bae,et al.  Understanding HLA associations from SNP summary association statistics , 2019, Scientific Reports.

[8]  Rachel M. Sherman,et al.  Assembly of a pan-genome from deep sequencing of 910 humans of African descent , 2018, Nature Genetics.

[9]  A. Dobin,et al.  Is it time to change the reference genome? , 2019, Genome Biology.

[10]  Roderic Guigó,et al.  The GEM mapper: fast, accurate and versatile alignment by filtration , 2012, Nature Methods.

[11]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[12]  J. Schrenzel,et al.  Listeria monocytogenes infectious periaortitis: a case report from the infectious disease standpoint , 2019, BMC Infectious Diseases.

[13]  J. Miquel,et al.  The European lactase persistence genotype determines the lactase persistence state and correlates with gastrointestinal symptoms in the Hispanic and Amerindian Chilean population: a case–control and population-based study , 2011, BMJ Open.

[14]  N. Martin,et al.  Associations of ADH and ALDH2 gene variation with self report alcohol reactions, consumption and dependence: an integrated analysis. , 2009, Human molecular genetics.

[15]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[16]  S. Slager,et al.  An analytical workflow for accurate variant discovery in highly divergent regions , 2016, BMC Genomics.

[17]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[18]  R. Slotkin,et al.  The case for not masking away repetitive DNA , 2018, Mobile DNA.

[19]  Tadashi Imanishi,et al.  Human Contamination in Public Genome Assemblies , 2016, PloS one.

[20]  Marc Fellous,et al.  The human Y chromosome: the biological role of a “functional wasteland” , 2001, Journal of biomedicine & biotechnology.

[21]  Maliha Aziz,et al.  NASP: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats. , 2016, Microbial genomics.

[22]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[23]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[24]  Joseph L DeRisi,et al.  Actionable diagnosis of neuroleptospirosis by next-generation sequencing. , 2014, The New England journal of medicine.

[25]  David L. Steffen,et al.  The DNA sequence of the human X chromosome , 2005, Nature.

[26]  Tungadri Bose,et al.  CS-SCORE: Rapid identification and removal of human genome contaminants from metagenomic datasets. , 2015, Genomics.

[27]  Nicholas G Martin,et al.  A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. , 2008, American journal of human genetics.

[28]  Sarah Sandmann,et al.  Evaluating Variant Calling Tools for Non-Matched Next-Generation Sequencing Data , 2017, Scientific Reports.

[29]  Paul Keim,et al.  MetaGeniE: Characterizing Human Clinical Samples Using Deep Metagenomic Sequencing , 2014, PloS one.

[30]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[31]  G. Palacios,et al.  Prospective Cohort Study of Next-Generation Sequencing as a Diagnostic Modality for Unexplained Encephalitis in Children. , 2019, Journal of the Pediatric Infectious Diseases Society.

[32]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[33]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[34]  A. Roberts,et al.  Mobile genetic elements in Clostridium difficile and their role in genome function , 2015, Research in microbiology.

[35]  Alla Lapidus,et al.  A Bioinformatician's Guide to Metagenomics , 2008, Microbiology and Molecular Biology Reviews.

[36]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[37]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[38]  S. Salzberg,et al.  Centrifuge: rapid and sensitive classification of metagenomic sequences , 2016, bioRxiv.

[39]  H. Seifert,et al.  Opportunity and Means: Horizontal Gene Transfer from the Human Host to a Bacterial Pathogen , 2011, mBio.

[40]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[41]  T. Peto,et al.  Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines , 2020, GigaScience.

[42]  M. Zaharia,et al.  A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples , 2014, Genome Research.

[43]  N. Loman,et al.  A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. , 2013, JAMA.

[44]  Syed Haider,et al.  Ensembl BioMarts: a hub for data retrieval across taxonomic space , 2011, Database J. Biol. Databases Curation.

[45]  Ole Lund,et al.  Rapid Whole-Genome Sequencing for Detection and Characterization of Microorganisms Directly from Clinical Samples , 2013, Journal of Clinical Microbiology.

[46]  Honglong Wu,et al.  Detection of pathogens from resected heart valves of patients with infective endocarditis by next-generation sequencing. , 2019, International journal of infectious diseases : IJID : official publication of the International Society for Infectious Diseases.