GrammR: graphical representation and modeling of count data with application in metagenomics

MOTIVATION Microbiota compositions have great implications in human health, such as obesity and other conditions. As such, it is of great importance to cluster samples or taxa to visualize and discover community substructures. Graphical representation of metagenomic count data relies on two aspects, measure of dissimilarity between samples/taxa and algorithm used to estimate coordinates to study microbiota communities. UniFrac is a dissimilarity measure commonly used in metagenomic research, but it requires a phylogenetic tree. Principal coordinate analysis (PCoA) is a popular algorithm for estimating two-dimensional (2D) coordinates for graphical representation, although alternative and higher-dimensional representations may reveal underlying community substructures invisible in 2D representations. RESULTS We adapt a new measure of dissimilarity, penalized Kendall's τ-distance, which does not depend on a phylogenetic tree, and hence more readily applicable to a wider class of problems. Further, we propose to use metric multidimensional scaling (MDS) as an alternative to PCoA for graphical representation. We then devise a novel procedure for determining the number of clusters in conjunction with PAM (mPAM). We show superior performances with higher-dimensional representations. We further demonstrate the utility of mPAM for accurate clustering analysis, especially with higher-dimensional MDS models. Applications to two human microbiota datasets illustrate greater insights into the subcommunity structure with a higher-dimensional analysis.

[1]  Hongzhe Li,et al.  Associating microbiome composition with environmental covariates using generalized UniFrac distances , 2012, Bioinform..

[2]  Roberto Romero,et al.  The composition and stability of the vaginal microbiota of normal pregnant women is different from that of non-pregnant women , 2014, Microbiome.

[3]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[4]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[5]  Paul F. Dietz Optimal Algorithms for List Indexing and Subset Rank , 1989, WADS.

[6]  Zhenqiu Liu,et al.  Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data , 2011, Bioinform..

[7]  Se-Ran Jun,et al.  Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions , 2009, Proceedings of the National Academy of Sciences.

[8]  R. Knight,et al.  Global patterns in bacterial diversity , 2007, Proceedings of the National Academy of Sciences.

[9]  B. Roe,et al.  A core gut microbiome in obese and lean twins , 2008, Nature.

[10]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[11]  Rob Knight,et al.  EMPeror: a tool for visualizing high-throughput microbial community data , 2013, GigaScience.

[12]  R. Knight,et al.  Bacterial Community Variation in Human Body Habitats Across Space and Time , 2009, Science.

[13]  J. Clemente,et al.  Gut Microbiota from Twins Discordant for Obesity Modulate Metabolism in Mice , 2013, Science.

[14]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[15]  J. Clemente,et al.  Human gut microbiome viewed across age and geography , 2012, Nature.

[16]  Georgios B. Giannakis,et al.  Sparsity-Exploiting Robust Multidimensional Scaling , 2012, IEEE Transactions on Signal Processing.

[17]  Qunyuan Zhang,et al.  Persistent Gut Microbiota Immaturity in Malnourished Bangladeshi Children , 2014, Nature.

[18]  David Fernández-Baca,et al.  Computing distances between partial rankings , 2009, Inf. Process. Lett..

[19]  L. Hubert,et al.  Comparing partitions , 1985 .

[20]  Peter J. Rousseeuw,et al.  Clustering by means of medoids , 1987 .

[21]  R. Knight,et al.  Evolution of Mammals and Their Gut Microbes , 2008, Science.

[22]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[23]  Ronald Fagin,et al.  Comparing Partial Rankings , 2006, SIAM J. Discret. Math..