Robust and Scalable Learning of Complex Intrinsic Dataset Geometry via ElPiGraph

Large datasets represented by multidimensional data point clouds often possess non-trivial distributions with branching trajectories and excluded regions, with the recent single-cell transcriptomic studies of developing embryo being notable examples. Reducing the complexity and producing compact and interpretable representations of such data remains a challenging task. Most of the existing computational methods are based on exploring the local data point neighbourhood relations, a step that can perform poorly in the case of multidimensional and noisy data. Here we present ElPiGraph, a scalable and robust method for approximation of datasets with complex structures which does not require computing the complete data distance matrix or the data point neighbourhood graph. This method is able to withstand high levels of noise and is capable of approximating complex topologies via principal graph ensembles that can be combined into a consensus principal graph. ElPiGraph deals efficiently with large and complex datasets in various fields from biology, where it can be used to infer gene dynamics from single-cell RNA-Seq, to astronomy, where it can be used to explore complex structures in the distribution of galaxies.

[1]  Achim Tresch,et al.  Semi-automated 3D Leaf Reconstruction and Analysis of Trichome Patterning from Light Microscopic Images , 2013, PLoS Comput. Biol..

[2]  Vin de Silva,et al.  Topological approximation by small simplicial complexes , 2003 .

[3]  Alexander N Gorban,et al.  Beyond The Concept of Manifolds: Principal Trees, Metro Maps, and Elastic Cubic Complexes , 2007, 0801.0176.

[4]  B. Schölkopf,et al.  MLLE: Modified Locally Linear Embedding Using Multiple Weights , 2007 .

[5]  Sean C. Bendall,et al.  Wishbone identifies bifurcating developmental trajectories from single-cell data , 2016, Nature Biotechnology.

[6]  Y. Hoffman,et al.  COSMOGRAPHY OF THE LOCAL UNIVERSE , 2013, 1306.0091.

[7]  Vladimir Pestov,et al.  Indexability, concentration, and VC theory , 2010, J. Discrete Algorithms.

[8]  Cole Trapnell,et al.  Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions , 2014, Nature Biotechnology.

[9]  Ivan Tyukin,et al.  Stochastic Separation Theorems , 2017, Neural Networks.

[10]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[11]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[12]  Donald C. Wunsch,et al.  Application of the method of elastic maps in analysis of genetic texts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[13]  Pierre-Antoine Absil,et al.  Principal Manifolds for Data Visualization and Dimension Reduction , 2007 .

[14]  Allon M. Klein,et al.  Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo , 2018, Science.

[15]  Allon M. Klein,et al.  The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution , 2018, Science.

[16]  Y. Saeys,et al.  Computational methods for trajectory inference from single‐cell transcriptomics , 2016, European journal of immunology.

[17]  Igor Adameyko,et al.  Multipotent peripheral glial cells generate neuroendocrine cells of the adrenal medulla , 2017, Science.

[18]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[19]  Amir Babaeian,et al.  Multiple Manifold Clustering Using Curvature Constrained Path , 2015, PloS one.

[20]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[21]  Shuigeng Zhou,et al.  Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM , 2019, Nature Communications.

[22]  Ricardo A. Baeza-Yates,et al.  Searching in metric spaces , 2001, CSUR.

[23]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[24]  A. N. Gorbana,et al.  Topological grammars for data approximation , 2006 .

[25]  Andrei Zinovyev,et al.  Overcoming Complexity of Biological Systems: from Data Analysis to Mathematical Modeling , 2015 .

[26]  M. Gromov,et al.  Isoperimetry of Waists and Concentration of Maps , 2003 .

[27]  Luca Pinello,et al.  Serum-Based Culture Conditions Provoke Gene Expression Variability in Mouse Embryonic Stem Cells as Revealed by Single-Cell Analysis. , 2016, Cell reports.

[28]  Alexander N. Gorban,et al.  Principal Manifolds and Graphs in Practice: from Molecular Biology to Dynamical Systems , 2010, Int. J. Neural Syst..

[29]  David van Dijk,et al.  Manifold learning-based methods for analyzing single-cell RNA-sequencing data , 2018 .

[30]  Michal Sheffer,et al.  Pathway-based personalized analysis of cancer , 2013, Proceedings of the National Academy of Sciences.

[31]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[32]  Alexander N. Gorban,et al.  Visualization of Data by Method of Elastic Maps and Its Applications in Genomics, Economics and Sociology , 2001 .

[33]  Eugenij Moiseevich Mirkes,et al.  Data complexity measured by principal graphs , 2013, Comput. Math. Appl..

[34]  Fabian J. Theis,et al.  PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells , 2017, Genome Biology.

[35]  I. Amit,et al.  Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors , 2015, Cell.

[36]  J Julian Blow,et al.  Buffered Qualitative Stability explains the robustness and evolvability of transcriptional networks , 2014, eLife.

[37]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[38]  Giosuè Lo Bosco,et al.  STREAM: Single-cell Trajectories Reconstruction, Exploration And Mapping of omics data , 2018, bioRxiv.

[39]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods: towards more accurate and robust tools , 2018, bioRxiv.

[40]  Leland McInnes,et al.  UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction , 2018, ArXiv.

[41]  L. Steinmetz,et al.  Human haematopoietic stem cell lineage commitment is a continuous process , 2017, Nature Cell Biology.

[42]  Mariella G. Filbin,et al.  Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma , 2016, Nature.

[43]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[44]  Li Qian,et al.  SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data , 2016, Genome Biology.

[45]  Alexander N. Gorban,et al.  Elastic Principal Graphs and Manifolds and their Practical Applications , 2005, Computing.

[46]  Alexander N. Gorban,et al.  Piece-wise quadratic approximations of arbitrary error functions for fast and robust machine learning , 2016, Neural Networks.

[47]  Fabian J Theis,et al.  Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics , 2018, Science.

[48]  Bernhard Schölkopf,et al.  Regularized Principal Manifolds , 1999, J. Mach. Learn. Res..

[49]  Adam Krzyzak,et al.  Piecewise Linear Skeletonization Using Principal Curves , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[50]  P. Lio’,et al.  Single-cell RNA-sequencing uncovers transcriptional states and fate decisions in haematopoiesis , 2017, bioRxiv.

[51]  Samuel L. Wolock,et al.  SPRING: a kinetic interface for visualizing high dimensional single-cell expression data , 2017 .

[52]  Alexander N. Gorban,et al.  Robust principal graphs for data approximation , 2016, ArXiv.

[53]  Ivan Tyukin,et al.  Blessing of dimensionality: mathematical foundations of the statistical physics of data , 2018, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[54]  Emmanuel Barillot,et al.  Mathematical Modelling of Molecular Pathways Enabling Tumour Cell Invasion and Migration , 2015, PLoS Comput. Biol..