Symbiont-Screener: a reference-free filter to automatically separate host sequences and contaminants for long reads or co-barcoded reads by unsupervised clustering

Decontamination is necessary for eliminating the effect of foreign genomes on the symbiont studies and biomedical discoveries. However, direct extraction of host sequencing reads with no references remains challenging. Here, we present a triobased method to classify the host error-prone long reads or sparse co-barcoded reads prior to assembly, free of any alignments against DNA or protein references. This method first identifies high-confident host reads by haplotype-specific k-mers inherited from parents, and then groups remaining host reads by the unsupervised clustering. Experimental results demonstrated that this approach successfully classified up to 97.38% of the host human long reads with the precision rate of 99.9999%, and 79.95% host co-barcoded reads with the precision rate of 98.36% using an artificially mixed data. Moreover, the tool also exhibited a good performance on the decontamination of the real algae data. The purified reads reconstructed two haplotypes and improved the assembly with larger contig NGA50 value and less misassemblies. Symbiont-Screener can be freely downloaded at https://github.com/BGI-Qingdao/Symbiont-Screener.

[1]  Christina Backes,et al.  BusyBee Web: metagenomic data analysis by bootstrapped supervised binning and annotation , 2017, Nucleic Acids Res..

[2]  L. S. Swapna,et al.  Comparative genomics of the major parasitic worms , 2017, Nature Genetics.

[3]  Mark Blaxter,et al.  BlobToolKit – Interactive Quality Assessment of Genome Assemblies , 2019, G3: Genes, Genomes, Genetics.

[4]  Kiyoshi Asai,et al.  PBSIM: PacBio reads simulator - toward accurate genome assembly , 2013, Bioinform..

[5]  Alexey A. Gurevich,et al.  QUAST: quality assessment tool for genome assemblies , 2013, Bioinform..

[6]  Su Yao,et al.  The global catalogue of microorganisms 10K type strain sequencing project: closing the genomic gaps for the validly published prokaryotic and fungi species , 2018, GigaScience.

[7]  Sergey Koren,et al.  De novo assembly of haplotype-resolved genomes with trio binning , 2018, Nature Biotechnology.

[8]  Timothy P. L. Smith,et al.  Haplotype-resolved genomes provide insights into structural variation and gene content in Angus and Brahman cattle , 2020, Nature Communications.

[9]  N. Moran,et al.  Functional Convergence in Reduced Genomes of Bacterial Symbionts Spanning 200 My of Evolution , 2010, Genome biology and evolution.

[10]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[11]  Pierre Geurts,et al.  Extremely randomized trees , 2006, Machine Learning.

[12]  Steven L. Salzberg,et al.  Unexpected cross-species contamination in genome sequencing projects , 2014, PeerJ.

[13]  S. Koren,et al.  Nanopore sequencing and assembly of a human genome with ultra-long reads , 2017, bioRxiv.

[14]  Fei Gao,et al.  CNGBdb: China National GeneBank DataBase. , 2020, Yi chuan = Hereditas.

[15]  Jennifer F. Hughes,et al.  The Biology and Evolution of Mammalian Y Chromosomes. , 2015, Annual review of genetics.

[16]  C. Benning,et al.  Algal-fungal symbiosis leads to photosynthetic mycelium , 2019, eLife.

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Sergey Koren,et al.  Mash Screen: high-throughput sequence containment estimation for genome discovery , 2019, Genome Biology.

[19]  Xun Xu,et al.  TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads , 2020, GigaScience.

[20]  Jonathan Romiguier,et al.  Contrasting GC-content dynamics across 33 mammalian genomes: relationship with life-history traits and chromosome sizes. , 2010, Genome research.

[21]  S. Koren,et al.  Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation , 2016, bioRxiv.

[22]  Janna L. Fierst,et al.  Decontaminating eukaryotic genome assemblies with machine learning , 2017, BMC Bioinformatics.

[23]  N. Weisenfeld,et al.  Direct determination of diploid genome sequences , 2016, bioRxiv.

[24]  M. Blaxter,et al.  Blobology: exploring raw genome data for contaminants, symbionts and parasites using taxon-annotated GC-coverage plots , 2013, Front. Genet..

[25]  Sergey Koren,et al.  Telomere-to-telomere assembly of a complete human X chromosome , 2019, bioRxiv.

[26]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[27]  R. Honegger Functional Aspects of the Lichen Symbiosis , 1991 .

[28]  Yu-Wei Wu,et al.  A Novel Abundance-Based Algorithm for Binning Metagenomic Sequences Using l-Tuples , 2010, RECOMB.

[29]  P. Šmarda,et al.  Ecological and evolutionary significance of genomic GC content diversity in monocots , 2014, Proceedings of the National Academy of Sciences.

[30]  K. Makova,et al.  Y and W Chromosome Assemblies: Approaches and Discoveries. , 2017, Trends in genetics : TIG.

[31]  Marcel Huntemann,et al.  Large-scale contamination of microbial isolate genomes by Illumina PhiX control , 2015, Standards in genomic sciences.

[32]  Gabor T. Marth,et al.  A global reference for human genetic variation , 2015, Nature.

[33]  K. Arakawa No evidence for extensive horizontal gene transfer from the draft genome of a tardigrade , 2016, Proceedings of the National Academy of Sciences.

[34]  K. Tagawa,et al.  A draft nuclear-genome assembly of the acoel flatworm Praesagittifera naikaiensis , 2019, GigaScience.

[35]  Anders F. Andersson,et al.  Binning metagenomic contigs by coverage and composition , 2014, Nature Methods.

[36]  R. Edwards,et al.  Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets , 2011, PloS one.

[37]  Huanming Yang,et al.  Haplotype-Resolved Assembly for Synthetic Long Reads Using a Trio-Binning Strategy , 2020, bioRxiv.

[38]  Fei Gao,et al.  CNSA: a data repository for archiving omics data , 2020, bioRxiv.

[39]  Jennifer M. Fettweis,et al.  The Integrative Human Microbiome Project , 2019, Nature.