Methods in Description and Validation of Local Metagenetic Microbial Communities

We propose minhash (as implemented by MASH) and NMF as alternative methods to estimate similarity between metagenetic samples. We further describe these results with cluster analysis and correlations with independent ecological metadata. Species and kmer abundance information is used to determine similarities and create clusters to better understand how communities interact, as well as relate to known environmental variables, such as Ph and Soil Conductivity. We use cluster silhouettes to assess various approaches for clustering metagenetic samples as well as anova to uncover links between metagenetic samples and the known environmental variables. By analyzing data from the Atacama desert and determining the relationship between ecological factors and group membership, we show the applicability of these methods.

[1]  Mark P. Waldrop,et al.  Multi-omics of permafrost, active layer and thermokarst bog soil microbiomes , 2015, Nature.

[2]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[3]  Renaud Gaujoux,et al.  A flexible R package for nonnegative matrix factorization , 2010, BMC Bioinformatics.

[4]  Sean M Gibbons,et al.  Microbial diversity--exploration of natural ecosystems and microbiomes. , 2015, Current opinion in genetics & development.

[5]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[6]  D. Bauer Constructing Confidence Sets Using Rank Statistics , 1972 .

[7]  Robert J. Plemmons,et al.  Nonnegative Matrices in the Mathematical Sciences , 1979, Classics in Applied Mathematics.

[8]  T. Lumley,et al.  PRINCIPAL COMPONENT ANALYSIS AND FACTOR ANALYSIS , 2004, Statistical Methods for Biomedical Research.

[9]  Mia Hubert,et al.  Integrating robust clustering techniques in S-PLUS , 1997 .

[10]  David C. Molik,et al.  SAKE (Single-cell RNA-Seq Analysis and Klustering Evaluation) Identifies Markers of Resistance to Targeted BRAF Inhibitors in Melanoma Cell Populations , 2017, bioRxiv.

[11]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[12]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[13]  Douglas B. Kell,et al.  Computational cluster validation in post-genomic data analysis , 2005, Bioinform..

[14]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[15]  A. Halpern,et al.  The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific , 2007, PLoS biology.

[16]  P. Paatero,et al.  Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values† , 1994 .

[17]  Mathew A. Leibold,et al.  Metacommunities: Spatial Dynamics and Ecological Communities , 2005 .

[18]  Dietrich Lehmann,et al.  Nonsmooth nonnegative matrix factorization (nsNMF) , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Brian D. Ondov,et al.  Mash: fast genome and metagenome distance estimation using MinHash , 2015, Genome Biology.

[20]  L. Mcquitty Similarity Analysis by Reciprocal Pairs for Discrete and Continuous Data , 1966 .

[21]  K. Parsons,et al.  Limits of Principal Components Analysis for Producing a Common Trait Space: Implications for Inferring Selection, Contingency, and Chance in Evolution , 2009, PloS one.

[22]  S. Kravitz,et al.  CAMERA: A Community Resource for Metagenomics , 2007, PLoS biology.

[23]  Bruno Jedynak,et al.  Colonization patterns of soil microbial communities in the Atacama Desert , 2013, Microbiome.

[24]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[25]  Philip H. Ramsey Nonparametric Statistical Methods , 1974, Technometrics.

[26]  C. Quince,et al.  Accurate determination of microbial diversity from 454 pyrosequencing data , 2009, Nature Methods.

[27]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .