Taxonomy-aware feature engineering for microbiome classification

BackgroundWhat is a healthy microbiome? The pursuit of this and many related questions, especially in light of the recently recognized microbial component in a wide range of diseases has sparked a surge in metagenomic studies. They are often not simply attributable to a single pathogen but rather are the result of complex ecological processes. Relatedly, the increasing DNA sequencing depth and number of samples in metagenomic case-control studies enabled the applicability of powerful statistical methods, e.g. Machine Learning approaches. For the latter, the feature space is typically shaped by the relative abundances of operational taxonomic units, as determined by cost-effective phylogenetic marker gene profiles. While a substantial body of microbiome/microbiota research involves unsupervised and supervised Machine Learning, very little attention has been put on feature selection and engineering.ResultsWe here propose the first algorithm to exploit phylogenetic hierarchy (i.e. an all-encompassing taxonomy) in feature engineering for microbiota classification. The rationale is to exploit the often mono- or oligophyletic distribution of relevant (but hidden) traits by virtue of taxonomic abstraction. The algorithm is embedded in a comprehensive microbiota classification pipeline, which we applied to a diverse range of datasets, distinguishing healthy from diseased microbiota samples.ConclusionWe demonstrate substantial improvements over the state-of-the-art microbiota classification tools in terms of classification accuracy, regardless of the actual Machine Learning technique while using drastically reduced feature spaces. Moreover, generalized features bear great explanatory value: they provide a concise description of conditions and thus help to provide pathophysiological insights. Indeed, the automatically and reproducibly derived features are consistent with previously published domain expert analyses.

[1]  Andreas Henschel,et al.  Comprehensive Meta-analysis of Ontology Annotated 16S rRNA Profiles Identifies Beta Diversity Clusters of Environmental Bacterial Communities , 2015, PLoS Comput. Biol..

[2]  Mark A. Hall,et al.  Correlation-based Feature Selection for Discrete and Numeric Class Machine Learning , 1999, ICML.

[3]  F. Levenez,et al.  Akkermansia muciniphila and improved metabolic health during a dietary intervention in obesity: relationship with gut microbiome richness and ecology , 2015, Gut.

[4]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[5]  J. Foster,et al.  Machine Learning Techniques Accurately Classify Microbial Communities by Bacterial Vaginosis Characteristics , 2014, PloS one.

[6]  M. Griffiths,et al.  The Development of the Problematic Online Gaming Questionnaire (POGQ) , 2012, PloS one.

[7]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[8]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[9]  Gregory Ditzler,et al.  Fizzy: feature subset selection for metagenomics , 2015, BMC Bioinformatics.

[10]  Eric J. Alm,et al.  Non-Invasive Mapping of the Gastrointestinal Microbiota Identifies Children with Inflammatory Bowel Disease , 2012, PloS one.

[11]  B. Birren,et al.  Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. , 2012, Genome research.

[12]  Jens Roat Kultima,et al.  Potential of fecal microbiota for early‐stage detection of colorectal cancer , 2014 .

[13]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[14]  Antonino Fiannaca,et al.  Probabilistic topic modeling for the analysis and classification of genomic sequences , 2015, BMC Bioinformatics.

[15]  W. D. de Vos,et al.  Akkermansia muciniphila and its role in regulating host functions. , 2017, Microbial pathogenesis.

[16]  Heiko Paulheim,et al.  Feature Selection in Hierarchical Feature Spaces , 2014, Discovery Science.

[17]  William A. Walters,et al.  Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample , 2010, Proceedings of the National Academy of Sciences.

[18]  Herbert Tilg,et al.  Gut microbiome development along the colorectal adenoma-carcinoma sequence , 2015 .

[19]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[20]  P. Schloss,et al.  The Human Gut Microbiome as a Screening Tool for Colorectal Cancer , 2014, Cancer Prevention Research.

[21]  L. T. Angenent,et al.  Bacterial community structures are unique and resilient in full-scale bioenergy systems , 2011, Proceedings of the National Academy of Sciences.

[22]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[23]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[24]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[25]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[26]  Andreas Deutsch,et al.  An Emerging Allee Effect Is Critical for Tumor Initiation and Persistence , 2015, PLoS Comput. Biol..

[27]  Jonathan Friedman,et al.  Inferring Correlation Networks from Genomic Survey Data , 2012, PLoS Comput. Biol..