Robust and scalable learning of data manifolds with complex topologies via ElPiGraph

We present ElPiGraph, a method for approximating data distributions having non-trivial topological features such as the existence of excluded regions or branching structures. Unlike many existing methods, ElPiGraph is not based on the construction of a k-nearest neighbour graph, a procedure that can perform poorly in the case of multidimensional and noisy data. Instead, ElPiGraph constructs elastic principal graphs in a more robust way by minimizing elastic energy, applying graph grammars and explicitly controlling topological complexity. Using trimmed approximation error function makes ElPiGraph extremely robust to the presence of background noise without decreasing computational performance and allows it to deal with complex cases of manifold learning (for example, ElPiGraph can learn disconnected intersecting manifolds). Thanks to the quasi-quadratic nature of the elastic function, ElPiGraph performs almost as fast as a simple k-means clustering and, therefore, is much more scalable than alternative methods, and can work on large datasets containing millions of high dimensional points on a personal computer. The excellent performance of the method opens the possibility to apply resampling and to approximate complex data structures via principal graph ensembles which can be used to construct consensus principal graphs. ElPiGraph is currently implemented in five programming languages and accompanied by a graphical user interface, which makes it a versatile tool to deal with complex data in various fields from molecular biology, where it can be used to infer pseudo-time trajectories from single-cell RNASeq, to astronomy, where it can be used to approximate complex structures in the distribution of galaxies.

[1]  B. Schölkopf,et al.  MLLE: Modified Locally Linear Embedding Using Multiple Weights , 2007 .

[2]  A. N. Gorbana,et al.  Topological grammars for data approximation , 2006 .

[3]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[4]  P. Lio’,et al.  Single-cell RNA-sequencing uncovers transcriptional states and fate decisions in haematopoiesis , 2017, bioRxiv.

[5]  Andrei Zinovyev,et al.  Overcoming Complexity of Biological Systems: from Data Analysis to Mathematical Modeling , 2015 .

[6]  Giosuè Lo Bosco,et al.  STREAM: Single-cell Trajectories Reconstruction, Exploration And Mapping of omics data , 2018, bioRxiv.

[7]  Mariella G. Filbin,et al.  Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma , 2016, Nature.

[8]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[9]  M. Gromov,et al.  Isoperimetry of Waists and Concentration of Maps , 2003 .

[10]  Alexander N Gorban,et al.  Beyond The Concept of Manifolds: Principal Trees, Metro Maps, and Elastic Cubic Complexes , 2007, 0801.0176.

[11]  J. A. Cuesta-Albertos,et al.  Trimmed $k$-means: an attempt to robustify quantizers , 1997 .

[12]  Sean C. Bendall,et al.  Wishbone identifies bifurcating developmental trajectories from single-cell data , 2016, Nature Biotechnology.

[13]  Alexander N. Gorban,et al.  Visualization of Data by Method of Elastic Maps and Its Applications in Genomics, Economics and Sociology , 2001 .

[14]  Eugenij Moiseevich Mirkes,et al.  Data complexity measured by principal graphs , 2013, Comput. Math. Appl..

[15]  Donald C. Wunsch,et al.  Application of the method of elastic maps in analysis of genetic texts , 2003, Proceedings of the International Joint Conference on Neural Networks, 2003..

[16]  Ivan Tyukin,et al.  Blessing of dimensionality: mathematical foundations of the statistical physics of data , 2018, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[17]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[18]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[19]  Luca Pinello,et al.  Serum-Based Culture Conditions Provoke Gene Expression Variability in Mouse Embryonic Stem Cells as Revealed by Single-Cell Analysis. , 2016, Cell reports.

[20]  Alexander N. Gorban,et al.  Principal Manifolds and Graphs in Practice: from Molecular Biology to Dynamical Systems , 2010, Int. J. Neural Syst..

[21]  Achim Tresch,et al.  Semi-automated 3D Leaf Reconstruction and Analysis of Trichome Patterning from Light Microscopic Images , 2013, PLoS Comput. Biol..

[22]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[23]  Pierre-Antoine Absil,et al.  Principal Manifolds for Data Visualization and Dimension Reduction , 2007 .

[24]  Li Qian,et al.  SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data , 2016, Genome Biology.

[25]  Ivan Tyukin,et al.  Stochastic Separation Theorems , 2017, Neural Networks.

[26]  Emmanuel Barillot,et al.  Mathematical Modelling of Molecular Pathways Enabling Tumour Cell Invasion and Migration , 2015, PLoS Comput. Biol..

[27]  Alexander N. Gorban,et al.  Elastic Principal Graphs and Manifolds and their Practical Applications , 2005, Computing.

[28]  Y. Saeys,et al.  Computational methods for trajectory inference from single‐cell transcriptomics , 2016, European journal of immunology.

[29]  Igor Adameyko,et al.  Multipotent peripheral glial cells generate neuroendocrine cells of the adrenal medulla , 2017, Science.

[30]  Amir Babaeian,et al.  Multiple Manifold Clustering Using Curvature Constrained Path , 2015, PloS one.

[31]  Emilio Soria Olivas,et al.  Handbook of Research on Machine Learning Applications and Trends : Algorithms , Methods , and Techniques , 2009 .

[32]  L. Steinmetz,et al.  Human haematopoietic stem cell lineage commitment is a continuous process , 2017, Nature Cell Biology.

[33]  Adam Krzyzak,et al.  Piecewise Linear Skeletonization Using Principal Curves , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Cole Trapnell,et al.  Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions , 2014, Nature Biotechnology.

[35]  I. Amit,et al.  Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors , 2015, Cell.

[36]  J Julian Blow,et al.  Buffered Qualitative Stability explains the robustness and evolvability of transcriptional networks , 2014, eLife.

[37]  Yvan Saeys,et al.  A comparison of single-cell trajectory inference methods: towards more accurate and robust tools , 2018, bioRxiv.

[38]  Michal Sheffer,et al.  Pathway-based personalized analysis of cancer , 2013, Proceedings of the National Academy of Sciences.

[39]  David van Dijk,et al.  Manifold learning-based methods for analyzing single-cell RNA-sequencing data , 2018 .

[40]  Alexander N. Gorban,et al.  Robust principal graphs for data approximation , 2016, ArXiv.