A geometric view of Biodiversity: scaling to metagenomics

We have designed a new efficient dimensionality reduction algorithm in order to investigate new ways of accurately characterizing the biodiversity, namely from a geometric point of view, scaling with large environmental sets produced by NGS ($\sim 10^5$ sequences). The approach is based on Multidimensional Scaling (MDS) that allows for mapping items on a set of $n$ points into a low dimensional euclidean space given the set of pairwise distances. We compute all pairwise distances between reads in a given sample, run MDS on the distance matrix, and analyze the projection on first axis, by visualization tools. We have circumvented the quadratic complexity of computing pairwise distances by implementing it on a hyperparallel computer (Turing, a Blue Gene Q), and the cubic complexity of the spectral decomposition by implementing a dense random projection based algorithm. We have applied this data analysis scheme on a set of $10^5$ reads, which are amplicons of a diatom environmental sample from Lake Geneva. Analyzing the shape of the point cloud paves the way for a geometric analysis of biodiversity, and for accurately building OTUs (Operational Taxonomic Units), when the data set is too large for implementing unsupervised, hierarchical, high-dimensional clustering.

[1]  F. Lejzerowicz,et al.  Next-Generation Environmental Diversity Surveys of Foraminifera: Preparing the Future , 2014, The Biological Bulletin.

[2]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[3]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[4]  P. Peres‐Neto,et al.  Ecology in the age of DNA barcoding: the resource, the promise and the challenges ahead , 2014, Molecular ecology resources.

[5]  E. Mayr The Growth of Biological Thought: Diversity, Evolution, and Inheritance , 1983 .

[6]  M E J Newman,et al.  Community structure in social and biological networks , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  F. Rimet,et al.  Next‐generation sequencing to inventory taxonomic diversity in eukaryotic communities: a test for freshwater diatoms , 2013, Molecular ecology resources.

[8]  Philippe Chaumeil,et al.  R-Syst::diatom: an open-access and curated barcode database for diatoms and freshwater monitoring , 2016, Database J. Biol. Databases Curation.

[9]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[10]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[11]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[12]  V. Heywood,et al.  Global Biodiversity Assessment , 1996 .

[13]  Robert C. Edgar,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2001 .

[14]  C. Pedrós-Alió,et al.  Unexpected diversity of small eukaryotes in deep-sea Antarctic plankton , 2001, Nature.

[15]  Pierre Blanchard Fast hierarchical algorithms for the low-rank approximation of matrices, with applications to materials physics, geostatistics and data analysis , 2017 .

[16]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[17]  William A. Walters,et al.  QIIME allows analysis of high-throughput community sequencing data , 2010, Nature Methods.

[18]  Jeremy R. deWaard,et al.  Biological identifications through DNA barcodes , 2003, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[19]  Martin Hartmann,et al.  Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities , 2009, Applied and Environmental Microbiology.

[20]  Stuart L. Simpson,et al.  Faster, Higher and Stronger? The Pros and Cons of Molecular Faunal Data for Assessing Ecosystem Condition , 2014 .

[21]  Alain Franc,et al.  A Next-Generation Sequencing Approach to River Biomonitoring Using Benthic Diatoms , 2014, Freshwater Science.

[22]  Alan Julian Izenman,et al.  Modern Multivariate Statistical Techniques , 2008 .

[23]  R. Macarthur The Problem of Pattern and Scale in Ecology: The Robert H. MacArthur Award Lecture , 2005 .

[24]  John Platt,et al.  FastMap, MetricMap, and Landmark MDS are all Nystrom Algorithms , 2005, AISTATS.

[25]  J. Kruskal Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis , 1964 .

[26]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[27]  Michel Verleysen,et al.  Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[28]  Daniel Müllner,et al.  fastcluster: Fast Hierarchical, Agglomerative Clustering Routines for R and Python , 2013 .

[29]  P. Taberlet,et al.  Environmental DNA , 2012, Molecular ecology.

[30]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[31]  D. Chessel,et al.  Measuring biological diversity using Euclidean metrics , 2002, Environmental and Ecological Statistics.

[32]  K. Gaston Global patterns in biodiversity , 2000, Nature.

[33]  F. Rimet,et al.  diagno-syst: a tool for accurate inventories in metabarcoding , 2016, 1611.09410.

[34]  W. John Kress,et al.  A DNA barcode for land plants , 2009, Proceedings of the National Academy of Sciences.

[35]  W. Torgerson Multidimensional scaling: I. Theory and method , 1952 .

[36]  D. Sorensen IMPLICITLY RESTARTED ARNOLDI/LANCZOS METHODS FOR LARGE SCALE EIGENVALUE CALCULATIONS , 1996 .

[37]  J. Spouge,et al.  CBOL Protist Working Group: Barcoding Eukaryotic Richness beyond the Animal, Plant, and Fungal Kingdoms , 2012, PLoS biology.

[38]  D. Baird,et al.  Environmental Barcoding: A Next-Generation Sequencing Approach for Biomonitoring Applications Using River Benthos , 2011, PloS one.

[39]  Frédéric Mahé,et al.  Swarm: robust and fast clustering method for amplicon-based studies , 2014, PeerJ.

[40]  Holly M. Bik,et al.  Sequencing our way towards understanding global eukaryotic biodiversity. , 2012, Trends in ecology & evolution.

[41]  Leo Liberti,et al.  Euclidean Distance Geometry and Applications , 2012, SIAM Rev..

[42]  Michael C. Hout,et al.  Multidimensional Scaling , 2003, Encyclopedic Dictionary of Archaeology.

[43]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[44]  P. Groenen,et al.  Modern multidimensional scaling , 1996 .

[45]  David G. Mann,et al.  The species concept in diatoms , 1999 .

[46]  Susan M. Huse,et al.  Microbial diversity in the deep sea and the underexplored “rare biosphere” , 2006, Proceedings of the National Academy of Sciences.

[47]  Joseph Felsenstein,et al.  Computational Molecular Evolution.Oxford Series in Ecology and Evolution.ByZiheng Yang. Oxford and New York: Oxford University Press. $115.00 (hardcover); $52.50 (paper). xvi + 357 p.; ill.; index. 0‐19‐856699‐9 (hc); 0‐19‐856702‐2 (pb). 2006. , 2008 .