Human contamination in bacterial genomes has created thousands of spurious proteins

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein "families" across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.

[1]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[2]  Daniel N. Baker,et al.  KrakenUniq: confident and fast metagenomics classification using unique k-mer counts , 2018, Genome Biology.

[3]  Antoine Danchin,et al.  No wisdom in the crowd: genome annotation in the era of big data – current status and future prospects , 2018, Microbial biotechnology.

[4]  R. O’Neill,et al.  Abundant Human DNA Contamination Identified in Non-Primate Genome Databases , 2011, PloS one.

[5]  Wen J. Li,et al.  Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation , 2015, Nucleic Acids Res..

[6]  Joseph L DeRisi,et al.  Actionable diagnosis of neuroleptospirosis by next-generation sequencing. , 2014, The New England journal of medicine.

[7]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[8]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[9]  Karen H. Miga,et al.  Utilizing mapping targets of sequences underrepresented in the reference assembly to reduce false positive alignments , 2015, Nucleic acids research.

[10]  S T Sherry,et al.  Reading between the LINEs: human genomic variation induced by LINE-1 retrotransposition. , 2000, Genome research.

[11]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[12]  Kozo Nakamura,et al.  Prevalence and distribution of intervertebral disc degeneration over the entire spine in a population-based cohort: the Wakayama Spine Study. , 2014, Osteoarthritis and cartilage.

[13]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[14]  M. Frommer,et al.  Sequence relationships of three human satellite DNAs. , 1986, Journal of molecular biology.

[15]  Johannes Söding,et al.  MMseqs2: sensitive protein sequence searching for the analysis of massive data sets , 2017, bioRxiv.

[16]  M. Batzer,et al.  Repetitive Elements May Comprise Over Two-Thirds of the Human Genome , 2011, PLoS genetics.

[17]  S. Salzberg,et al.  Fast algorithms for large-scale genome alignment and comparison. , 2002, Nucleic acids research.

[18]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[19]  Tom O. Delmont,et al.  Identifying contamination with advanced visualization and analysis practices: metagenomic approaches for eukaryotic genome assemblies , 2016, PeerJ.

[20]  Chao Xie,et al.  Fast and sensitive protein alignment using DIAMOND , 2014, Nature Methods.

[21]  M. Pop,et al.  Metagenomic Assembly: Overview, Challenges and Applications , 2016, The Yale journal of biology and medicine.

[22]  Robert D. Finn,et al.  The Dfam database of repetitive DNA families , 2015, Nucleic Acids Res..

[23]  O. Gascuel,et al.  SeaView version 4: A multiplatform graphical user interface for sequence alignment and phylogenetic tree building. , 2010, Molecular biology and evolution.

[24]  H. Smuts,et al.  Novel Hybrid Parvovirus-Like Virus, NIH-CQV/PHV, Contaminants in Silica Column-Based Nucleic Acid Extraction Kits , 2013, Journal of Virology.

[25]  Wen J. Li,et al.  RefSeq: an update on prokaryotic genome annotation and curation , 2017, Nucleic Acids Res..

[26]  M. Batzer,et al.  Alu repeats and human genomic diversity , 2002, Nature Reviews Genetics.

[27]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[28]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[29]  Yun S. Song,et al.  The Simons Genome Diversity Project: 300 genomes from 142 diverse populations , 2016, Nature.

[30]  Robert C. Edgar,et al.  MUSCLE: a multiple sequence alignment method with reduced time and space complexity , 2004, BMC Bioinformatics.

[31]  Byron Gallis,et al.  Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains , 2007, Genome Biology.

[32]  F. W. Smith,et al.  Evidence for extensive horizontal gene transfer from the draft genome of a tardigrade , 2015, Proceedings of the National Academy of Sciences.

[33]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[34]  A C C Gibbs,et al.  Data Analysis , 2009, Encyclopedia of Database Systems.

[35]  K. Arakawa No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade , 2016, Proceedings of the National Academy of Sciences.

[36]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[37]  S. Salzberg,et al.  The Value of Complete Microbial Genome Sequencing (You Get What You Pay For) , 2002, Journal of bacteriology.

[38]  Vanya Gant,et al.  Diagnosis of Neuroinvasive Astrovirus Infection in an Immunocompromised Adult With Encephalitis by Unbiased Next-Generation Sequencing , 2015, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[39]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[40]  J. Lawrence,et al.  Demethylated HSATII DNA and HSATII RNA Foci Sequester PRC1 and MeCP2 into Cancer-Specific Nuclear Bodies. , 2017, Cell reports.

[41]  H. Seifert,et al.  Opportunity and Means: Horizontal Gene Transfer from the Human Host to a Bacterial Pathogen , 2011, mBio.

[42]  Mauro Maggioni,et al.  Genomic Characterization of Large Heterochromatic Gaps in the Human Genome Assembly , 2014, PLoS Comput. Biol..

[43]  Florian P Breitwieser,et al.  Pavian: interactive analysis of metagenomics data for microbiome studies and pathogen identification , 2019, Bioinform..

[44]  Steven Salzberg,et al.  Removing contaminants from databases of draft genomes , 2018, PLoS Comput. Biol..

[45]  S. Salzberg Horizontal gene transfer is not a hallmark of the human genome , 2017, Genome Biology.

[46]  Eric P. Nawrocki,et al.  NCBI prokaryotic genome annotation pipeline , 2016, Nucleic acids research.

[47]  M. Wilson,et al.  Next-generation sequencing of tissue , 2016, Neurology: Neuroimmunology & Neuroinflammation.

[48]  M. Garrido-Ramos Satellite DNA: An Evolving Topic , 2017, Genes.

[49]  Gary L. Gallia,et al.  Next-generation sequencing in neuropathologic diagnosis of infections of the nervous system , 2016, Neurology: Neuroimmunology & Neuroinflammation.

[50]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[51]  Steven L. Salzberg,et al.  Pavian: Interactive analysis of metagenomics data for microbiomics and pathogen identification , 2016, bioRxiv.

[52]  Tadashi Imanishi,et al.  Human Contamination in Public Genome Assemblies , 2016, PloS one.

[53]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[54]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[55]  Dominique Lavenier,et al.  PLAST: parallel local alignment search tool for database comparison , 2009, BMC Bioinformatics.

[56]  B. Vissel,et al.  Human alpha satellite DNA--consensus sequence and conserved regions. , 1987, Nucleic acids research.

[57]  Florian P Breitwieser,et al.  A review of methods and databases for metagenomic classification and assembly , 2019, Briefings Bioinform..

[58]  Gos Micklem,et al.  Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes , 2015, Genome Biology.