16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses

Advances in high-throughput sequencing have increased the availability of microbiome sequencing data that can be exploited to characterize microbiome community structure in situ. We explore using word and sentence embedding approaches for nucleotide sequences since they may be a suitable numerical representation for downstream machine learning applications (especially deep learning). This work involves first encoding (“embedding”) each sequence into a dense, low-dimensional, numeric vector space. Here, we use Skip-Gram word2vec to embed k-mers, obtained from 16S rRNA amplicon surveys, and then leverage an existing sentence embedding technique to embed all sequences belonging to specific body sites or samples. We demonstrate that these representations are meaningful, and hence the embedding space can be exploited as a form of feature extraction for exploratory analysis. We show that sequence embeddings preserve relevant information about the sequencing data such as k-mer context, sequence taxonomy, and sample class. Specifically, the sequence embedding space resolved differences among phyla, as well as differences among genera within the same family. Distances between sequence embeddings had similar qualities to distances between alignment identities, and embedding multiple sequences can be thought of as generating a consensus sequence. In addition, embeddings are versatile features that can be used for many downstream tasks, such as taxonomic and sample classification. Using sample embeddings for body site classification resulted in negligible performance loss compared to using OTU abundance data, and clustering embeddings yielded high fidelity species clusters. Lastly, the k-mer embedding space captured distinct k-mer profiles that mapped to specific regions of the 16S rRNA gene and corresponded with particular body sites. Together, our results show that embedding sequences results in meaningful representations that can be used for exploratory analyses or for downstream machine learning applications that require numeric data. Moreover, because the embeddings are trained in an unsupervised manner, unlabeled data can be embedded and used to bolster supervised machine learning tasks.

[1]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[2]  W. D. de Vos,et al.  Role of the intestinal microbiome in health and disease : from correlation to causation , 2012 .

[3]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[4]  Susan M. Huse,et al.  Exploring the oral microbiota of children at various developmental stages of their dentition in the relation to their oral health , 2011, BMC Medical Genomics.

[5]  Alice C. McHardy,et al.  MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples , 2018, Bioinformatics.

[6]  Robert C. Edgar,et al.  Updating the 97% identity threshold for 16S ribosomal RNA OTUs , 2017, bioRxiv.

[7]  Paul J. McMurdie,et al.  Exact sequence variants should replace operational taxonomic units in marker-gene data analysis , 2017, The ISME Journal.

[8]  Ben Nichols,et al.  Distributed under Creative Commons Cc-by 4.0 Vsearch: a Versatile Open Source Tool for Metagenomics , 2022 .

[9]  T. Keku,et al.  Gut Microbiome and Colorectal Adenomas , 2014, Cancer journal.

[10]  Rob Knight,et al.  American Gut: an Open Platform for Citizen Science Microbiome Research , 2018, mSystems.

[11]  Zongxin Ling,et al.  Decreased Diversity of the Oral Microbiota of Patients with Hepatitis B Virus-Induced Chronic Liver Disease: A Pilot Project , 2015, Scientific Reports.

[12]  B. Sampaio-Maia,et al.  Acquisition and maturation of oral microbiome throughout childhood: An update , 2014, Dental research journal.

[13]  J. Tiedje,et al.  Naïve Bayesian Classifier for Rapid Assignment of rRNA Sequences into the New Bacterial Taxonomy , 2007, Applied and Environmental Microbiology.

[14]  Mihai Pop,et al.  A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity , 2016, npj Biofilms and Microbiomes.

[15]  Pascal Vincent,et al.  Learning to Compute Word Embeddings On the Fly , 2017, ArXiv.

[16]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[17]  Wojciech Samek,et al.  Methods for interpreting and understanding deep neural networks , 2017, Digit. Signal Process..

[18]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[19]  Ehsaneddin Asgari,et al.  A New Approach for Scalable Analysis of Microbial Communities , 2015, ArXiv.

[20]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[21]  A. Brauman,et al.  Quantification of denitrifying bacteria in soils by nirK gene targeted real-time PCR. , 2004, Journal of microbiological methods.

[22]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[23]  Tong Zhang,et al.  Supervised and Semi-Supervised Text Categorization using LSTM for Region Embeddings , 2016, ICML.

[24]  M. Wong,et al.  Metagenomic sequencing of the human gut microbiome before and after bariatric surgery in obese patients with type 2 diabetes: correlation with inflammatory and metabolic parameters , 2012, The Pharmacogenomics Journal.

[25]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[26]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[27]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[28]  L. Jackson,et al.  Application of Real-Time PCR To Study Effects of Ammonium on Population Size of Ammonia-Oxidizing Bacteria in Soil , 2004, Applied and Environmental Microbiology.

[29]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[30]  C. Karp,et al.  Obesity and the gut microbiome: Striving for causality. , 2012, Molecular metabolism.

[31]  B. Sokhansanj,et al.  Engineering Human Microbiota: Influencing Cellular and Community Dynamics for Therapeutic Applications. , 2016, International review of cell and molecular biology.

[32]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[33]  Yoshua Bengio,et al.  Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model , 2008, IEEE Transactions on Neural Networks.

[34]  Xinlei Chen,et al.  Visualizing and Understanding Neural Models in NLP , 2015, NAACL.

[35]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[36]  Yasmine Belkaid,et al.  The human skin microbiome , 2018, Nature Reviews Microbiology.

[37]  Gail L. Rosen,et al.  POGO-DB—a database of pairwise-comparisons of genomes and conserved orthologous genes , 2013, Nucleic Acids Res..

[38]  J. Raes,et al.  Reconciliation between operational taxonomic units and species boundaries , 2017, FEMS microbiology ecology.

[39]  Lubos Polerecky,et al.  Oxygenic photosynthesis as a protection mechanism for cyanobacteria against iron-encrustation in environments with high Fe2+ concentrations , 2014, Front. Microbiol..

[40]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[41]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[42]  Julia Oh,et al.  Shifts in human skin and nares microbiota of healthy children and adults , 2012, Genome Medicine.

[43]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[44]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[45]  Andrew Gordon Wilson,et al.  Multimodal Word Distributions , 2017, ACL.

[46]  L. Albenberg,et al.  Gut microbiota and IBD: causation or correlation? , 2017, Nature Reviews Gastroenterology &Hepatology.

[47]  Eoin L. Brodie,et al.  Greengenes, a Chimera-Checked 16S rRNA Gene Database and Workbench Compatible with ARB , 2006, Applied and Environmental Microbiology.

[48]  Gregory B. Gloor,et al.  Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data. , 2016, Canadian journal of microbiology.

[49]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[50]  Finale Doshi-Velez,et al.  Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction , 2015, NIPS.

[51]  Satoshi Matsuoka,et al.  Word Embeddings, Analogies, and Machine Learning: Beyond king - man + woman = queen , 2016, COLING.

[52]  Jonathan L. Golob,et al.  Evaluating the accuracy of amplicon-based microbiome computational pipelines on simulated human gut microbial communities , 2017, BMC Bioinformatics.

[53]  Nung Kion Lee,et al.  Evaluation of Convolutionary Neural Networks Modeling of DNA Sequences using Ordinal versus one-hot Encoding Method , 2017, bioRxiv.

[54]  Ramakrishnan Sitaraman,et al.  Aging and the human gut microbiota—from correlation to causality , 2015, Front. Microbiol..

[55]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[56]  D. Albertson,et al.  Abstract 4885: Changes in abundance of oral microbiota associated with oral cancer , 2014 .

[57]  V. Erdmann,et al.  Sequence of the tufA gene encoding elongation factor EF-Tu from Thermus aquaticus and overproduction of the protein in Escherichia coli. , 1992, European journal of biochemistry.

[58]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[59]  Sharon L. Grim,et al.  Analysis, Optimization and Verification of Illumina-Generated 16S rRNA Gene Amplicon Surveys , 2014, PloS one.

[60]  Angela C. Poole,et al.  Human Genetics Shape the Gut Microbiome , 2014, Cell.

[61]  Finale Doshi-Velez,et al.  Increasing the Interpretability of Recurrent Neural Networks Using Hidden Markov Models , 2016, ArXiv.

[62]  Ning Chen,et al.  Chromatin accessibility prediction via convolutional long short-term memory networks with k-mer embedding , 2017, Bioinform..

[63]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[64]  Erik Wright,et al.  DECIPHER: harnessing local sequence context to improve protein multiple sequence alignment , 2015, BMC Bioinformatics.

[65]  W. M. Vos,et al.  Role of the intestinal microbiome in health and disease: from correlation to causation , 2012 .

[66]  Stephanie C. Hicks,et al.  Analysis and correction of compositional bias in sparse sequencing count data , 2017, BMC Genomics.

[67]  Wen Jiang,et al.  Application of high-throughput sequencing in understanding human oral microbiome related with health and disease , 2014, Front. Microbiol..