Latent Dirichlet Allocation reveals spatial and taxonomic structure in a DNA‐based census of soil biodiversity from a tropical forest

High‐throughput sequencing of amplicons from environmental DNA samples permits rapid, standardized and comprehensive biodiversity assessments. However, retrieving and interpreting the structure of such data sets requires efficient methods for dimensionality reduction. Latent Dirichlet Allocation (LDA) can be used to decompose environmental DNA samples into overlapping assemblages of co‐occurring taxa. It is a flexible model‐based method adapted to uneven sample sizes and to large and sparse data sets. Here, we compare LDA performance on abundance and occurrence data, and we quantify the robustness of the LDA decomposition by measuring its stability with respect to the algorithm's initialization. We then apply LDA to a survey of 1,131 soil DNA samples that were collected in a 12‐ha plot of primary tropical forest and amplified using standard primers for bacteria, protists, fungi and metazoans. The analysis reveals that bacteria, protists and fungi exhibit a strong spatial structure, which matches the topographical features of the plot, while metazoans do not, confirming that microbial diversity is primarily controlled by environmental variation at the studied scale. We conclude that LDA is a sensitive, robust and computationally efficient method to detect and interpret the structure of large DNA‐based biodiversity data sets. We finally discuss the possible future applications of this approach for the study of biodiversity.

[1]  Alexander E. White,et al.  Regional influences on community structure across the tropical-temperate divide , 2019, Nature Communications.

[2]  Md Saydur Rahman,et al.  Past, present, and future perspectives of environmental DNA (eDNA) metabarcoding: A systematic review in methods, monitoring, and applications of global eDNA , 2019, Global Ecology and Conservation.

[3]  P. Taberlet,et al.  Body size determines soil community assembly in a tropical forest. , 2018, Molecular ecology.

[4]  Emmanuel Paradis,et al.  ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R , 2018, Bioinform..

[5]  P. Legendre Numerical Ecology , 2019, Encyclopedia of Ecology.

[6]  R. Fletcher,et al.  Extending the Latent Dirichlet Allocation model to presence/absence data: A case study on North American breeding birds and biogeographical shifts expected from climate change , 2018, Global change biology.

[7]  M. Doebeli,et al.  Correcting for 16S rRNA gene copy numbers in microbiome surveys remains an unsolved problem , 2018, Microbiome.

[8]  P. Taberlet,et al.  Environmental DNA: For Biodiversity Research and Monitoring , 2018 .

[9]  Kris Sankaran,et al.  Latent variable modeling for the microbiome. , 2017, Biostatistics.

[10]  Rick L. Stevens,et al.  A communal catalogue reveals Earth’s multiscale microbial diversity , 2017, Nature.

[11]  Anna Norberg,et al.  How to make more out of community data? A conceptual framework and its implementation as models and software. , 2017, Ecology letters.

[12]  J. Chave,et al.  Quantifying micro-environmental variation in tropical rainforest understory at landscape scale by combining airborne LiDAR scanning and a sensor network , 2017, Annals of Forest Science.

[13]  Ian Holmes,et al.  Linking Statistical and Ecological Theory: Hubbell's Unified Neutral Theory of Biodiversity as a Hierarchical Dirichlet Process , 2014, Proceedings of the IEEE.

[14]  Pierre Taberlet,et al.  Inferring neutral biodiversity parameters using environmental DNA data sets , 2016, Scientific Reports.

[15]  Shaowen Yao,et al.  An overview of topic modeling and its current applications in bioinformatics , 2016, SpringerPlus.

[16]  Wenjun Li,et al.  Editorial: Actinobacteria in Special and Extreme Habitats: Diversity, Function Roles, and Environmental Adaptations , 2016, Front. Microbiol..

[17]  Karoline Faust,et al.  Millions of reads, thousands of taxa: microbial community structure and associations analyzed via marker genes. , 2016, FEMS microbiology reviews.

[18]  Sophie J. Weiss,et al.  Correlation detection strategies in microbial data sets vary widely in sensitivity and precision , 2016, The ISME Journal.

[19]  G. Kowalchuk,et al.  The Ecology of Acidobacteria: Moving beyond Genes and Genomes , 2016, Front. Microbiol..

[20]  Matthew Stephens,et al.  Visualizing the structure of RNA-seq expression data using grade of membership models , 2016, bioRxiv.

[21]  Francis K. C. Hui,et al.  So Many Variables: Joint Modeling in Community Ecology. , 2015, Trends in ecology & evolution.

[22]  O. Phillips,et al.  Using repeated small-footprint LiDAR acquisitions to infer spatial and temporal variations of a high-biomass Neotropical forest , 2015 .

[23]  Luis Pedro Coelho,et al.  Structure and function of the global ocean microbiome , 2015, Science.

[24]  Sara Taskinen,et al.  Model‐based approaches to unconstrained ordination , 2015 .

[25]  Hong Gu,et al.  BioMiCo: a supervised Bayesian model for inference of microbial community structure , 2015, Microbiome.

[26]  K. Peay,et al.  Parsing ecological signal from noise in next generation amplicon sequencing. , 2015, The New phytologist.

[27]  Eske Willerslev,et al.  Environmental DNA - An emerging tool in conservation for monitoring past and present biodiversity , 2015 .

[28]  Matthias Mauch,et al.  The Minor fall, the Major lift: inferring emotional valence of musical chords through lyrics , 2015, Royal Society Open Science.

[29]  R. Chazdon,et al.  Decomposing biodiversity data using the Latent Dirichlet Allocation model, a probabilistic multivariate statistical method , 2014, Ecology letters.

[30]  P. Schloss,et al.  Dynamics and associations of microbial community types across the human body , 2014, Nature.

[31]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[32]  Mihai Datcu,et al.  Latent Dirichlet Allocation for Spatial Analysis of Satellite Images , 2013, IEEE Transactions on Geoscience and Remote Sensing.

[33]  Susanne A. Fritz,et al.  An Update of Wallace’s Zoogeographic Regions of the World , 2013, Science.

[34]  Tu Bao Ho,et al.  Fully Sparse Topic Models , 2012, ECML/PKDD.

[35]  R. Knight,et al.  Diversity, stability and resilience of the human gut microbiota , 2012, Nature.

[36]  J. Raes,et al.  Microbial interactions: from networks to models , 2012, Nature Reviews Microbiology.

[37]  Dominik Olszewski Employing Kullback-Leibler divergence and Latent Dirichlet Allocation for fraud detection in telecommunications , 2012, Intell. Data Anal..

[38]  P. Taberlet,et al.  Environmental DNA , 2012, Molecular ecology.

[39]  R. Knight,et al.  Sequencing our way towards understanding global eukaryotic biodiversity. , 2012, Trends in ecology & evolution.

[40]  C. Quince,et al.  Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics , 2012, PloS one.

[41]  Arun Balagopalan Improving Topic Reproducibility in Topic Models , 2012 .

[42]  Rob Knight,et al.  Bayesian community-wide culture-independent microbial source tracking , 2011, Nature Methods.

[43]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[44]  Kurt Hornik,et al.  topicmodels : An R Package for Fitting Topic Models , 2016 .

[45]  Jonathan M. Chase,et al.  Navigating the multiple meanings of β diversity: a roadmap for the practicing ecologist. , 2011, Ecology letters.

[46]  Stephen E Fienberg,et al.  Reconceptualizing the classification of PNAS articles , 2010, Proceedings of the National Academy of Sciences.

[47]  David B. Dunson,et al.  Probabilistic topic models , 2011, KDD '11 Tutorials.

[48]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[49]  Carl T. Bergstrom,et al.  The map equation , 2009, 0906.1405.

[50]  Mollie E. Brooks,et al.  Generalized linear mixed models: a practical guide for ecology and evolution. , 2009, Trends in ecology & evolution.

[51]  P. Legendre,et al.  Forward selection of explanatory variables. , 2008, Ecology.

[52]  A. Kerkhoff,et al.  Microbes on mountainsides: Contrasting elevational patterns of bacterial and plant diversity , 2008, Proceedings of the National Academy of Sciences.

[53]  Andrew McCallum,et al.  Topic Models Conditioned on Arbitrary Features with Dirichlet-multinomial Regression , 2008, UAI.

[54]  Frans Bongers,et al.  Above-ground biomass and productivity in a rain forest of eastern South America , 2008, Journal of Tropical Ecology.

[55]  R. B. Jackson,et al.  Toward an ecological classification of soil bacteria. , 2007, Ecology.

[56]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[57]  Robert I. McDonald,et al.  The distance decay of similarity in ecological communities , 2007 .

[58]  P. Legendre,et al.  vegan : Community Ecology Package. R package version 1.8-5 , 2007 .

[59]  W. Sloan,et al.  Modeling Taxa-Abundance Distributions in Microbial Communities using Environmental Sequence Data , 2007, Microbial Ecology.

[60]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[61]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Jean Thioulouse,et al.  The ade4 package - I : One-table methods , 2004 .

[63]  J. Thioulouse,et al.  The ade 4 package-I : One-table methods by , 2004 .

[64]  David R. Anderson,et al.  Model Selection and Multimodel Inference , 2003 .

[65]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[66]  G. Quinn,et al.  Experimental Design and Data Analysis for Biologists , 2002 .

[67]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[68]  K. Beven,et al.  A physically based, variable contributing area model of basin hydrology , 1979 .