An encoding of genome content for machine learning

Abstract An ever-growing number of metagenomes can be used for biomining and the study of microbial functions. The use of learning algorithms in this context has been hindered, because they often need input in the form of low-dimensional, dense vectors of numbers. We propose such a representation for genomes called nanotext that scales to very large data sets. The underlying model is learned from a corpus of nearly 150 thousand genomes spanning 750 million protein domains. We treat the protein domains in a genome like words in a document, assuming that protein domains in a similar context have similar “meaning”. This meaning can be distributed by a neural net over a vector of numbers. The resulting vectors efficiently encode function, preserve known phylogeny, capture subtle functional relationships and are robust against genome incompleteness. The “functional” distance between two vectors complements nucleotide-based distance, so that genomes can be identified as similar even though their nucleotide identity is low. nanotext can thus encode (meta)genomes for direct use in downstream machine learning tasks. We show this by predicting plausible culture media for metagenome assembled genomes (MAGs) from the Tara Oceans Expedition using their genome content only. nanotext is freely released under a BSD licence (https://github.com/phiweger/nanotext).

[1]  B. Snel,et al.  Genome trees and the nature of genome evolution. , 2005, Annual review of microbiology.

[2]  Tom O. Delmont,et al.  Nitrogen-fixing populations of Planctomycetes and Proteobacteria are abundant in surface ocean metagenomes , 2018, Nature Microbiology.

[3]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[4]  Lisa R. Moore,et al.  Physiology and molecular phylogeny of coexisting Prochlorococcus ecotypes , 1998, Nature.

[5]  Eric A. Franzosa,et al.  Gut microbiome structure and metabolic activity in inflammatory bowel disease , 2018, Nature Microbiology.

[6]  G. Storz,et al.  Regulatory RNAs in Bacteria , 2009, Cell.

[7]  Nitin Kumar,et al.  Culturing of ‘unculturable’ human microbiota reveals novel taxa and extensive sporulation , 2016, Nature.

[8]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[9]  Alan McNally,et al.  Why prokaryotes have pangenomes , 2017, Nature Microbiology.

[10]  D. M. Ward,et al.  Identifying the fundamental units of bacterial diversity: A paradigm shift to incorporate ecology into bacterial systematics , 2008, Proceedings of the National Academy of Sciences.

[11]  Manesh Shah,et al.  Genome divergence in two Prochlorococcus ecotypes reflects oceanic niche differentiation , 2003, Nature.

[12]  Christian V. Forst,et al.  Defining genes: a computational framework , 2009, Theory in Biosciences.

[13]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[14]  E. Wright,et al.  Exclusivity offers a sound yet practical species criterion for bacteria despite abundant gene flow , 2018, BMC Genomics.

[15]  L. Bottou,et al.  27th Annual Conference on Neural Information Processing Systems 2013: December 5-10, Lake Tahoe, Nevada, USA , 2014, NIPS 2014.

[16]  B. Snel,et al.  Genome phylogeny based on gene content , 1999, Nature Genetics.

[17]  M. Dunn,et al.  A human gut bacterial genome and culture collection for improved metagenomic analyses , 2019, Nature Biotechnology.

[18]  Donovan H. Parks,et al.  A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life , 2018, Nature Biotechnology.

[19]  F. Cohan What are bacterial species? , 2002, Annual review of microbiology.

[20]  Zachary Wu,et al.  Learned protein embeddings for machine learning , 2018, Bioinformatics.

[21]  Peter B. McGarvey,et al.  UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches , 2014, Bioinform..

[22]  David H Burkhardt,et al.  Operon mRNAs are organized into ORF-centric structures that predict translation efficiency , 2017, eLife.

[23]  Edward W. Davis,et al.  Genome variations between rhizosphere and bulk soil ecotypes of a Pseudomonas koreensis population , 2018, Environmental microbiology.

[24]  Sarah A Teichmann,et al.  How do proteins gain new domains? , 2010, Genome Biology.

[25]  A. Elofsson,et al.  Structure is three to ten times more conserved than sequence—A study of structural response in protein cores , 2009, Proteins.

[26]  Tom O. Delmont,et al.  Linking pangenomes and metagenomes: the Prochlorococcus metapangenome , 2018, PeerJ.

[27]  Kai Blin,et al.  antiSMASH 4.0—improvements in chemistry prediction and gene cluster boundary identification , 2017, Nucleic Acids Res..

[28]  I. Nookaew,et al.  Insights from 20 years of bacterial genome sequencing , 2015, Functional & Integrative Genomics.

[29]  D. Kaftan,et al.  Unique double concentric ring organization of light harvesting complexes in Gemmatimonas phototrophica , 2017, PLoS biology.

[30]  F. Cohan Towards a conceptual and operational union of bacterial systematics, ecology, and evolution , 2006, Philosophical Transactions of the Royal Society B: Biological Sciences.

[31]  Hana Medová,et al.  Functional type 2 photosynthetic reaction centers found in the rare bacterial phylum Gemmatimonadetes , 2014, Proceedings of the National Academy of Sciences.

[32]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[33]  W. Fischer,et al.  Evolution of Oxygenic Photosynthesis , 2016 .

[34]  Miriam L. Land,et al.  Trace: Tennessee Research and Creative Exchange Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification Recommended Citation Prodigal: Prokaryotic Gene Recognition and Translation Initiation Site Identification , 2022 .

[35]  Xiandong Meng,et al.  Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing , 2015, PloS one.

[36]  Alejandro Ochoa,et al.  Domain prediction with probabilistic directional context , 2016, bioRxiv.

[37]  W. Doolittle,et al.  It’s the song, not the singer: an exploration of holobiosis and evolutionary theory , 2017 .

[38]  Tim Sandle,et al.  An approach for the reporting of microbiological results from water systems. , 2004, PDA journal of pharmaceutical science and technology.

[39]  J. DeBruyn,et al.  Global Biogeography and Quantitative Seasonal Dynamics of Gemmatimonadetes in Soil , 2011, Applied and Environmental Microbiology.

[40]  David M. Blei,et al.  Probabilistic topic models , 2012, Commun. ACM.

[41]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[42]  C. Chothia,et al.  Structure, function and evolution of multidomain proteins. , 2004, Current opinion in structural biology.

[43]  Jesse R. Zaneveld,et al.  Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences , 2013, Nature Biotechnology.

[44]  Donovan H. Parks,et al.  Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life , 2017, Nature Microbiology.

[45]  Robert D. Finn,et al.  The Pfam protein families database: towards a more sustainable future , 2015, Nucleic Acids Res..

[46]  A. Phillippy,et al.  High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries , 2017, Nature Communications.

[47]  Tjerk P. Straatsma,et al.  NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations , 2010, Comput. Phys. Commun..

[48]  Lisa R. Moore,et al.  LIMNOLOGY and OCEANOGRAPHY: METHODS Culturing the marine cyanobacterium Prochlorococcus , 2022 .

[49]  U. Gophna,et al.  Harnessing the landscape of microbial culture media to predict new organism–media pairings , 2015, Nature Communications.