MicroPheno: predicting environments and host phenotypes from 16S rRNA gene sequencing using a k-mer based representation of shallow sub-samples

Motivation Microbial communities play important roles in the function and maintenance of various biosystems, ranging from the human body to the environment. A major challenge in microbiome research is the classification of microbial communities of different environments or host phenotypes. The most common and cost‐effective approach for such studies to date is 16S rRNA gene sequencing. Recent falls in sequencing costs have increased the demand for simple, efficient and accurate methods for rapid detection or diagnosis with proved applications in medicine, agriculture and forensic science. We describe a reference‐ and alignment‐free approach for predicting environments and host phenotypes from 16S rRNA gene sequencing based on k‐mer representations that benefits from a bootstrapping framework for investigating the sufficiency of shallow sub‐samples. Deep learning methods as well as classical approaches were explored for predicting environments and host phenotypes. Results A k‐mer distribution of shallow sub‐samples outperformed Operational Taxonomic Unit (OTU) features in the tasks of body‐site identification and Crohn's disease prediction. Aside from being more accurate, using k‐mer features in shallow sub‐samples allows (i) skipping computationally costly sequence alignments required in OTU‐picking and (ii) provided a proof of concept for the sufficiency of shallow and short‐length 16S rRNA sequencing for phenotype prediction. In addition, k‐mer features predicted representative 16S rRNA gene sequences of 18 ecological environments, and 5 organismal environments with high macro‐F1 scores of 0.88 and 0.87. For large datasets, deep learning outperformed classical methods such as Random Forest and Support Vector Machine. Availability and implementation The software and datasets are available at https://llp.berkeley.edu/micropheno.

[1]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[2]  Aleksandar Milosavljevic,et al.  Gastrointestinal microbiome signatures of pediatric patients with irritable bowel syndrome. , 2011, Gastroenterology.

[3]  Heng Tao Shen,et al.  Principal Component Analysis , 2009, Encyclopedia of Biometrics.

[4]  J. Neufeld,et al.  Life in a World without Microbes , 2014, PLoS biology.

[5]  Rafael A. Irizarry,et al.  Meta-analysis of gut microbiome studies identifies disease-specific and shared responses , 2017, Nature Communications.

[6]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[7]  Alexander Statnikov,et al.  A comprehensive evaluation of multicategory classification methods for microbiomic data , 2013, Microbiome.

[8]  Mary Ann Moran,et al.  The global ocean microbiome , 2015, Science.

[9]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[10]  Ahmed A. Metwally,et al.  Analysis of the microbiome: Advantages of whole genome shotgun versus 16S amplicon sequencing. , 2016, Biochemical and biophysical research communications.

[11]  M. Blaser,et al.  The human microbiome: at the interface of health and disease , 2012, Nature Reviews Genetics.

[12]  Wataru Iwasaki,et al.  MetaMetaDB: A Database and Analytic System for Investigating Microbial Habitability , 2014, PloS one.

[13]  Mihai Pop,et al.  A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity , 2016, npj Biofilms and Microbiomes.

[14]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[15]  Antonio Gonzalez,et al.  Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences , 2014, PeerJ.

[16]  D. Raj,et al.  The gut microbiome, kidney disease, and targeted interventions. , 2014, Journal of the American Society of Nephrology : JASN.

[17]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[18]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[19]  P. Gallins,et al.  A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction , 2019, Front. Genet..

[20]  S. Palumbi,et al.  Uncovering hidden worlds of ocean biodiversity , 2015, Science.

[21]  S. Lynch,et al.  The Human Intestinal Microbiome in Health and Disease. , 2016, The New England journal of medicine.

[22]  Dan Turner,et al.  Alterations in the gut microbiome of children with severe ulcerative colitis , 2012, Inflammatory bowel diseases.

[23]  L. Fulton,et al.  Diet-induced obesity is linked to marked but reversible alterations in the mouse distal gut microbiome. , 2008, Cell host & microbe.

[24]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[25]  Harry Sokol,et al.  A microbial signature for Crohn's disease , 2017, Gut.

[26]  P. Savelkoul,et al.  Robust Microbiota-Based Diagnostics for Inflammatory Bowel Disease , 2017, Journal of Clinical Microbiology.

[27]  Derrick E. Wood,et al.  Kraken: ultrafast metagenomic sequence classification using exact alignments , 2014, Genome Biology.

[28]  R. Edwards,et al.  Explaining microbial phenotypes on a genomic scale: GWAS for microbes , 2013, Briefings in functional genomics.

[29]  J. Clemente,et al.  Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in Mice , 2013, Science.

[30]  N. Fierer Embracing the unknown: disentangling the complexities of the soil microbiome , 2017, Nature Reviews Microbiology.

[31]  August E. Woerner,et al.  Targeted sequencing of clade-specific markers from skin microbiomes for forensic human identification. , 2018, Forensic science international. Genetics.

[32]  K. Turner,et al.  Metatranscriptomics of the Human Oral Microbiome during Health and Disease , 2014, mBio.

[33]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[34]  Paul J. McMurdie,et al.  DADA2: High resolution sample inference from Illumina amplicon data , 2016, Nature Methods.

[35]  T. Scheffer,et al.  Taxonomic metagenome sequence assignment with structured output models , 2011, Nature Methods.

[36]  Lu Wang,et al.  The NIH Human Microbiome Project. , 2009, Genome research.

[37]  Tobias Kollmann,et al.  Early infancy microbial and metabolic alterations affect risk of childhood asthma , 2015, Science Translational Medicine.

[38]  Laxmi Parida,et al.  Host Phenotype Prediction from Differentially Abundant Microbes Using RoDEO , 2016, CIBB.

[39]  Pelin Yilmaz,et al.  The SILVA ribosomal RNA gene database project: improved data processing and web-based tools , 2012, Nucleic Acids Res..

[40]  Dirk Roggenbuck,et al.  Diagnosis and classification of Crohn's disease. , 2014, Autoimmunity reviews.

[41]  Ken Kleinman,et al.  The prevalence and geographic distribution of Crohn's disease and ulcerative colitis in the United States. , 2007, Clinical gastroenterology and hepatology : the official clinical practice journal of the American Gastroenterological Association.

[42]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Randal S. Olson,et al.  Data-driven advice for applying machine learning to bioinformatics problems , 2017, PSB.

[44]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[45]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Charles W. Bradley,et al.  Cutaneous Leishmaniasis Induces a Transmissible Dysbiotic Skin Microbiota that Promotes Skin Inflammation. , 2017, Cell host & microbe.

[49]  Benjamin J Marsland,et al.  The airway microbiome and disease. , 2013, Chest.

[50]  Anders Krogh,et al.  Kaiju: Fast and sensitive taxonomic classification for metagenomics , 2015, bioRxiv.

[51]  M. Watson,et al.  The Madness of Microbiome: Attempting To Find Consensus “Best Practice” for 16S Microbiome Studies , 2018, Applied and Environmental Microbiology.

[52]  Blair Lawley,et al.  Analysis of 16S rRNA Gene Amplicon Sequences Using the QIIME Software Package. , 2017, Methods in molecular biology.

[53]  K. Konstantinidis,et al.  Strengths and Limitations of 16S rRNA Gene Amplicon Sequencing in Revealing Temporal Microbial Community Dynamics , 2014, PloS one.

[54]  Wei Zheng,et al.  ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time , 2017, PLoS Comput. Biol..

[55]  A. Platzer Visualization of SNPs with t-SNE , 2013, PloS one.

[56]  Norman Pavelka,et al.  Advantages of meta-total RNA sequencing (MeTRS) over shotgun metagenomics and amplicon-based sequencing in the profiling of complex microbial communities , 2018, npj Biofilms and Microbiomes.

[57]  Martin Wu,et al.  Surprisingly extensive mixed phylogenetic and ecological signals among bacterial Operational Taxonomic Units , 2013, Nucleic acids research.

[58]  Eric P. Nawrocki,et al.  An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea , 2011, The ISME Journal.

[59]  Irene Wagner-Döbler,et al.  Dysbiosis in chronic periodontitis: Key microbial players and interactions with the human host , 2017, Scientific Reports.

[60]  Sebastian Deorowicz,et al.  CoMeta: Classification of Metagenomes Using k-mers , 2015, PloS one.

[61]  William A. Walters,et al.  Erratum to: Stability of operational taxonomic units: an important but neglected property for analyzing microbial diversity , 2015, Microbiome.

[62]  Gregory Ditzler,et al.  Multi-Layer and Recursive Neural Networks for Metagenomic Classification , 2015, IEEE Transactions on NanoBioscience.

[63]  R. Knight,et al.  Forensic identification using skin bacterial communities , 2010, Proceedings of the National Academy of Sciences.

[64]  C. C. Zhang,et al.  The common variants implicated in microstructural abnormality of first episode and drug-naïve patients with schizophrenia , 2017, Scientific Reports.

[65]  Carl Kingsford,et al.  A fast, lock-free approach for efficient parallel counting of occurrences of k-mers , 2011, Bioinform..

[66]  Jean-Philippe Vert,et al.  Large-scale machine learning for metagenomics sequence classification , 2015, Bioinform..

[67]  Jonathon Shlens,et al.  A Tutorial on Principal Component Analysis , 2014, ArXiv.

[68]  Wei Wang,et al.  MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. , 2019, Methods.

[69]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[70]  S. Abbott,et al.  16S rRNA Gene Sequencing for Bacterial Identification in the Diagnostic Laboratory: Pluses, Perils, and Pitfalls , 2007, Journal of Clinical Microbiology.

[71]  Antonio Gasbarrini,et al.  Fecal Microbiota Transplantation for the Treatment of Clostridium difficile Infection: A Systematic Review , 2014, Journal of clinical gastroenterology.

[72]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[73]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[74]  Hongfei Cui,et al.  Alignment-free supervised classification of metagenomes by recursive SVM , 2013, BMC Genomics.

[75]  Ameet J Pinto,et al.  Bacterial community structure in the drinking water microbiome is governed by filtration processes. , 2012, Environmental science & technology.

[76]  Xiangrong Liu,et al.  Application of Machine Learning in Microbiology , 2019, Front. Microbiol..

[77]  Amy M. Sheflin,et al.  Manipulating the soil microbiome to increase soil health and plant fertility , 2012, Biology and Fertility of Soils.

[78]  R. Knight,et al.  Microbial community profiling for human microbiome projects: Tools, techniques, and challenges. , 2009, Genome research.

[79]  Se Jin Song,et al.  The treatment-naive microbiome in new-onset Crohn's disease. , 2014, Cell host & microbe.

[80]  Rob Knight,et al.  UCHIME improves sensitivity and speed of chimera detection , 2011, Bioinform..

[81]  Aiping Wu,et al.  MetaDP: a comprehensive web server for disease prediction of 16S rRNA metagenomic datasets , 2016, Biophysics reports.

[82]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[83]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[84]  I. Rigoutsos,et al.  Accurate phylogenetic classification of variable-length DNA fragments , 2007, Nature Methods.

[85]  Rob Knight,et al.  A Microbiome Foundation for the Study of Crohn's Disease. , 2017, Cell host & microbe.

[86]  Philippe Esling,et al.  Predicting the Ecological Quality Status of Marine Environments from eDNA Metabarcoding Data Using Supervised Machine Learning. , 2017, Environmental science & technology.

[87]  Vanessa M. Peterson,et al.  Multiplexed quantification of proteins and transcripts in single cells , 2017, Nature Biotechnology.

[88]  B. Hayes,et al.  Metagenomic Predictions: From Microbiome to Complex Health and Environmental Phenotypes in Humans and Cattle , 2013, PloS one.

[89]  Sean C. Bendall,et al.  viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia , 2013, Nature Biotechnology.