Phylogeny-based classification of microbial communities

MOTIVATION Next-generation sequencing coupled with metagenomics has led to the rapid growth of sequence databases and enabled a new branch of microbiology called comparative metagenomics. Comparative metagenomic analysis studies compositional patterns within and between different environments providing a deep insight into the structure and function of complex microbial communities. It is a fast growing field that requires the development of novel supervised learning techniques for addressing challenges associated with metagenomic data, e.g. sensitivity to the choice of sequence similarity cutoff used to define operational taxonomic units (OTUs), high dimensionality and sparsity of the data and so forth. On the other hand, the natural properties of microbial community data may provide useful information about the structure of the data. For example, similarity between species encoded by a phylogenetic tree captures the relationship between OTUs and may be useful for the analysis of complex microbial datasets where the diversity patterns comprise features at multiple taxonomic levels. Even though some of the challenges have been addressed by learning algorithms in the literature, none of the available methods take advantage of the inherent properties of metagenomic data. RESULTS We proposed a novel supervised classification method for microbial community samples, where each sample is represented as a set of OTU frequencies, which takes advantage of the natural structure in microbial community data encoded by a phylogenetic tree. This model allows us to take advantage of environment-specific compositional patterns that may contain features at multiple granularity levels. Our method is based on the multinomial logistic regression model with a tree-guided penalty function. Additionally, we proposed a new simulation framework for generating 16S ribosomal RNA gene read counts that may be useful in comparative metagenomics research. Our experimental results on simulated and real data show that the phylogenetic information used in our method improves the classification accuracy. AVAILABILITY AND IMPLEMENTATION http://www.cs.ucr.edu/~tanaseio/metaphyl.htm.

[1]  Rob Knight,et al.  PyNAST: a flexible tool for aligning sequences to a template alignment , 2009, Bioinform..

[2]  R. Knight,et al.  The Human Microbiome Project , 2007, Nature.

[3]  P. Zhao,et al.  The composite absolute penalties family for grouped and hierarchical variable selection , 2009, 0909.0411.

[4]  Paramvir S. Dehal,et al.  FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments , 2010, PloS one.

[5]  Haixu Tang,et al.  Comparing Bacterial Communities Inferred from 16s Rrna Gene Sequencing and Shotgun Metagenomics , 2011, Pacific Symposium on Biocomputing.

[6]  Anne-Laure Boulesteix,et al.  Overview of random forest methodology and practical guidance with emphasis on computational biology and bioinformatics , 2012, WIREs Data Mining Knowl. Discov..

[7]  Trevor J. Hastie,et al.  Genome-wide association analysis by lasso penalized logistic regression , 2009, Bioinform..

[8]  Gunnar Rätsch,et al.  Support Vector Machines and Kernels for Computational Biology , 2008, PLoS Comput. Biol..

[9]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[10]  P. Bork,et al.  Enterotypes of the human gut microbiome , 2011, Nature.

[11]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[12]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[13]  R. Knight,et al.  Supervised classification of human microbiota. , 2011, FEMS microbiology reviews.

[14]  K. Schleifer,et al.  Phylogenetic identification and in situ detection of individual microbial cells without cultivation. , 1995, Microbiological reviews.

[15]  Jonathan M. Garibaldi,et al.  Learning Pathway-based Decision Rules to Classify Microarray Cancer Samples , 2010, GCB.

[16]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[17]  Cesare Furlanello,et al.  mlpy: Machine Learning Python , 2012, ArXiv.

[18]  Eric P. Xing,et al.  Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity , 2009, ICML.

[19]  Michael R. Thon,et al.  Supervised Protein Family Classification and New Family Construction , 2012, J. Comput. Biol..

[20]  Giri Narasimhan,et al.  An ecoinformatics tool for microbial community studies : Supervised classification of Amplicon Length Heterogeneity ( ALH ) profiles of 16 S rRNA , 2006 .

[21]  Dmitriy Fradkin,et al.  Bayesian Multinomial Logistic Regression for Author Identification , 2005, AIP Conference Proceedings.

[22]  Mihai Pop,et al.  Statistical Methods for Detecting Differentially Abundant Features in Clinical Metagenomic Samples , 2009, PLoS Comput. Biol..

[23]  Jonathan A. Eisen,et al.  The Phylogenetic Diversity of Metagenomes , 2011, PloS one.

[24]  R. Knight,et al.  Global patterns in bacterial diversity , 2007, Proceedings of the National Academy of Sciences.

[25]  J. Handelsman,et al.  Introducing TreeClimber, a Test To Compare Microbial Community Structures , 2006, Applied and Environmental Microbiology.

[26]  Jonathan D. G. Jones,et al.  Application of 'next-generation' sequencing technologies to microbial genetics , 2009, Nature Reviews Microbiology.

[27]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[28]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[29]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[30]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[31]  Musa H. Asyali,et al.  Gene Expression Profile Classification: A Review , 2006 .

[32]  Jian Xu,et al.  Meta-Storms: efficient search for similar microbial communities based on a novel indexing scheme and similarity score for metagenomic data , 2012, Bioinform..

[33]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[34]  E. Mardis,et al.  An obesity-associated gut microbiome with increased capacity for energy harvest , 2006, Nature.

[35]  Fengzhu Sun,et al.  Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny , 2011, BMC Bioinformatics.

[36]  Jesse R. Zaneveld,et al.  Human-associated microbial signatures: examining their predictive value. , 2011, Cell host & microbe.

[37]  R. Knight,et al.  Species divergence and the measurement of microbial diversity. , 2008, FEMS microbiology reviews.

[38]  Jean-Philippe Vert,et al.  Group lasso with overlap and graph lasso , 2009, ICML '09.

[39]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[40]  Katherine H. Huang,et al.  Structure, Function and Diversity of the Healthy Human Microbiome , 2012, Nature.

[41]  Yuzhen Ye,et al.  Identification and quantification of abundant species from pyrosequences of 16S rRNA by consensus alignment , 2010, 2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).