MarkovHC: Markov hierarchical clustering for the topological structure of high-dimensional single-cell omics data

Distinguishing cell types and cell states is one of the fundamental questions in single-cell studies. Meanwhile, exploring the lineage relations among cells and finding the path and critical points in the cell fate transition are also of great importance. Existing unsupervised clustering methods and lineage trajectory reconstruction methods often face several challenges such as clustering data of arbitrary shapes, tracking precise trajectories and identifying critical points. Certain adaptive landscape approach1–3, which constructs a pseudo-energy landscape of the dynamical system, may be used to explore such problems. Thus, we propose Markov hierarchical clustering algorithm (MarkovHC), which reconstructs multi-scale pseudo-energy landscape by exploiting underlying metastability structure in an exponentially perturbed Markov chain4. A Markov process describes the random walk of a hypothetically traveling cell in the corresponding pseudo-energy landscape over possible gene expression states. Technically, MarkovHC integrates the tasks of cell classification, trajectory reconstruction, and critical point identification in a single theoretical framework consistent with topological data analysis (TDA)5. In addition to the algorithm development and simulation tests, we also applied MarkovHC to diverse types of real biological data: single-cell RNA-Seq data, cytometry data, and single-cell ATAC-Seq data. Remarkably, when applying to single-cell RNA-Seq data of human ESC derived progenitor cells6, MarkovHC not only could successfully identify known cell types, but also discover new cell types and stages. In addition, when using MarkovHC to analyze single-cell RNA-Seq data of human preimplantation embryos in early development7, the hierarchical structure of the lineage trajectories was faithfully reconstituted. Furthermore, the critical points representing important stage transitions had also been identified by MarkovHC from early gastric cancer data8. In summary, these results demonstrate that MarkovHC is a powerful tool based on rigorous metastability theory to explore hierarchical structures of biological data, to identify a cell sub-population (basin) and a critical point (stage transition), and to track a lineage trajectory (differentiation path). Highlights MarkovHC explores the topology hierarchy in high-dimensional data. MarkovHC can find clusters (basins) and cores (attractors) of clusters in different scales. The trajectory of state transition (transition paths) and critical points in the process of state transition (critical points) among clusters can be tracked. MarkovHC can be applied on diverse types of single-cell omics data.

[1]  Hannah A. Pliner,et al.  Reversed graph embedding resolves complex single-cell trajectories , 2017, Nature Methods.

[2]  Andrew J. Hill,et al.  The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[3]  Erik Sundström,et al.  RNA velocity of single cells , 2018, Nature.

[4]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[5]  Fabian J Theis,et al.  Generalizing RNA velocity to transient cell states through dynamical modeling , 2019, Nature Biotechnology.

[6]  Yong Wang,et al.  DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data , 2019, Nature Communications.

[7]  N. Hacohen,et al.  Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors , 2017, Science.

[8]  David van Dijk,et al.  Visualizing Structure and Transitions for Biological Data Exploration , 2017, bioRxiv.

[9]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[10]  Liang Ma,et al.  DensityPath: an algorithm to visualize and reconstruct cell state-transition path on density landscape for single-cell RNA sequencing data , 2018, Bioinform..

[11]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[12]  L. Hood,et al.  Cancer as robust intrinsic state of endogenous molecular-cellular network shaped by evolution. , 2008, Medical hypotheses.

[13]  Hans Clevers,et al.  OLFM4 is a robust marker for stem cells in human intestine and marks a subset of colorectal cancer cells. , 2009, Gastroenterology.

[14]  Steve Oudot,et al.  Two-Tier Mapper, an unbiased topology-based clustering method for enhanced global gene expression analysis , 2019, Bioinform..

[15]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[16]  Alvaro Plaza Reyes,et al.  Single-Cell RNA-Seq Reveals Lineage and X Chromosome Dynamics in Human Preimplantation Embryos , 2016, Cell.

[17]  C-L Wang,et al.  Long non-coding RNA NEAT1 promotes viability and migration of gastric cancer cell lines through up-regulation of microRNA-17. , 2018, European review for medical and pharmacological sciences.

[18]  Edsger W. Dijkstra,et al.  A note on two problems in connexion with graphs , 1959, Numerische Mathematik.

[19]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[20]  R. Prim Shortest connection networks and some generalizations , 1957 .

[21]  Gregory W. Schwartz,et al.  TooManyCells identifies and visualizes relationships of single-cell clades , 2020, Nature Methods.

[22]  S. Dongen Graph clustering by flow simulation , 2000 .

[23]  Stephen Fox,et al.  Role of p53 in the progression of gastric cancer , 2014, Oncotarget.

[24]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[25]  Cole Trapnell,et al.  The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells , 2014, Nature Biotechnology.

[26]  Sean C. Bendall,et al.  Extracting a Cellular Hierarchy from High-dimensional Cytometry Data with SPADE , 2011, Nature Biotechnology.

[27]  Shoji Natsugoe,et al.  Role of cyclin E and p53 expression in progression of early gastric cancer , 1998, Gastric Cancer.

[28]  Srinivasa R. S. Varadhan,et al.  Asymptotic probabilities and differential equations , 1966 .

[29]  F. Ginhoux,et al.  Mpath maps multi-branching single-cell trajectories revealing progenitor cell progression during development , 2016, Nature Communications.

[30]  Luke Zappia,et al.  Clustering trees: a visualization for evaluating clusterings at multiple resolutions , 2018, bioRxiv.

[31]  Gregory W. Schwartz,et al.  TooManyCells identifies and visualizes relationships of single-cell clades , 2019, Nature Methods.

[32]  Sean C. Bendall,et al.  Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum , 2011, Science.

[33]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[34]  Garry P Nolan,et al.  Visualization and cellular hierarchy inference of single-cell data using SPADE , 2016, Nature Protocols.

[35]  C. Waddington,et al.  The strategy of the genes , 1957 .

[36]  D. Defays,et al.  An Efficient Algorithm for a Complete Link Method , 1977, Comput. J..

[37]  P. Ao,et al.  Laws in Darwinian evolutionary theory , 2005, q-bio/0605020.

[38]  A. Regev,et al.  Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis , 2018, Science.

[39]  Evan Z. Macosko,et al.  Comprehensive Classification of Retinal Bipolar Neurons by Single-Cell Transcriptomics , 2016, Cell.

[40]  Jin Wang,et al.  Quantifying Cell Fate Decisions for Differentiation and Reprogramming of a Human Stem Cell Network: Landscape and Biological Paths , 2013, PLoS Comput. Biol..

[41]  Zhao Kang,et al.  Kernel-driven similarity learning , 2017, Neurocomputing.

[42]  Hannah H. Chang,et al.  Cell Fate Decision as High-Dimensional Critical State Transition , 2016, bioRxiv.

[43]  P. Rigollet,et al.  Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming , 2019, Cell.

[44]  Bo Wang,et al.  Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning , 2016, Nature Methods.

[45]  Yu Kang,et al.  Modeling stochastic phenotype switching and bet-hedging in bacteria: stochastic nonlinear dynamics and critical state identification , 2013, Quantitative Biology.

[46]  Kevin R. Moon,et al.  Recovering Gene Interactions from Single-Cell Data Using Data Diffusion , 2018, Cell.

[47]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[48]  Fabian J Theis,et al.  The Human Cell Atlas , 2017, bioRxiv.

[49]  A. M. Arias,et al.  Transition states and cell fate decisions in epigenetic landscapes , 2016, Nature Reviews Genetics.

[50]  Paul Hoffman,et al.  Integrating single-cell transcriptomic data across different conditions, technologies, and species , 2018, Nature Biotechnology.

[51]  Hongkai Ji,et al.  TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis , 2016, Nucleic acids research.

[52]  P. Ao Global view of bionetwork dynamics: adaptive landscape. , 2009, Journal of genetics and genomics = Yi chuan xue bao.

[53]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[54]  Izhak Haviv,et al.  Distinctive patterns of gene expression in premalignant gastric mucosa and gastric cancer. , 2003, Cancer research.

[55]  Joshua W. K. Ho,et al.  CIDR: Ultrafast and accurate clustering through imputation for single-cell RNA-seq data , 2016, Genome Biology.

[56]  Jonathan S. Packer,et al.  A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution , 2019, Science.

[57]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[58]  Arthur Zimek,et al.  Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection , 2015, ACM Trans. Knowl. Discov. Data.

[59]  Sean C. Bendall,et al.  Single-Cell Trajectory Detection Uncovers Progression and Regulatory Coordination in Human B Cell Development , 2014, Cell.

[60]  Sean C. Bendall,et al.  Wishbone identifies bifurcating developmental trajectories from single-cell data , 2016, Nature Biotechnology.

[61]  Hui Wang,et al.  SINCERA: A Pipeline for Single-Cell RNA-Seq Profiling Analysis , 2015, PLoS Comput. Biol..

[62]  I. Pinchuk,et al.  Effect of Helicobacter pylori on gastric epithelial cells. , 2014, World journal of gastroenterology.

[63]  R. Satija,et al.  Single-cell RNA sequencing to explore immune cell heterogeneity , 2017, Nature Reviews Immunology.

[64]  Xiaohong Xu,et al.  MiR-596 down regulates SOX4 expression and is a potential novel biomarker for gastric cancer , 2020, Translational cancer research.

[65]  P. Rigollet,et al.  Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming , 2019, Cell.

[66]  Chen Dayue,et al.  Metastability of exponentially perturbed Markov chains , 1996 .

[67]  Fabian J Theis,et al.  PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells , 2019, Genome Biology.

[68]  Fabian J. Theis,et al.  Diffusion maps for high-dimensional single-cell analysis of differentiation data , 2015, Bioinform..

[69]  Mirjana Efremova,et al.  CellPhoneDB: inferring cell–cell communication from combined expression of multi-subunit ligand–receptor complexes , 2020, Nature Protocols.

[70]  A. Oshlack,et al.  Splatter: simulation of single-cell RNA sequencing data , 2017, Genome Biology.

[71]  M. Hemberg,et al.  Identifying cell populations with scRNASeq. , 2017, Molecular aspects of medicine.

[72]  M. Hemberg,et al.  Challenges in unsupervised clustering of single-cell RNA-seq data , 2019, Nature Reviews Genetics.

[73]  Cesar H. Comin,et al.  Clustering algorithms: A comparative approach , 2016, PloS one.

[74]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[75]  Peng Zhang,et al.  Dissecting the Single-Cell Transcriptome Network Underlying Gastric Premalignant Lesions and Early Gastric Cancer. , 2019, Cell reports.

[76]  Hong Qian,et al.  Processes on the emergent landscapes of biochemical reaction networks and heterogeneous cell population dynamics: differentiation in living matters , 2017, Journal of The Royal Society Interface.

[77]  Hongkai Ji,et al.  Pseudotime Reconstruction Using TSCAN. , 2019, Methods in molecular biology.

[78]  H. Qian Cycle kinetics, steady state thermodynamics and motors—a paradigm for living matter physics , 2005, Journal of physics. Condensed matter : an Institute of Physics journal.

[79]  L. Wasserman Topological Data Analysis , 2016, 1609.08227.

[80]  C. Waddington,et al.  Principles of development and differentiation , 1956 .

[81]  Yiguang Hong,et al.  Unsupervised topological alignment for single-cell multi-omics integration , 2020, bioRxiv.

[82]  Fabian J Theis,et al.  Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells , 2015, Nature Biotechnology.

[83]  Fabian J Theis,et al.  Diffusion pseudotime robustly reconstructs lineage branching , 2016, Nature Methods.

[84]  Carlos Alcocer-Cuarón,et al.  Hierarchical structure of biological systems , 2014, Bioengineered.

[85]  R. Stewart,et al.  Single-cell RNA-seq reveals novel regulators of human embryonic stem cell differentiation to definitive endoderm , 2016, Genome Biology.

[86]  Christopher Yau,et al.  pcaReduce: hierarchical clustering of single cell transcriptional profiles , 2015, BMC Bioinformatics.

[87]  M. Schaub,et al.  SC3 - consensus clustering of single-cell RNA-Seq data , 2016, Nature Methods.

[88]  Yiguang Hong,et al.  Unsupervised topological alignment for single-cell multi-omics integration , 2020, Bioinformatics.

[89]  Jian Cheng,et al.  Role of cyclooxygenase-2 in gastric cancer development and progression. , 2013, World journal of gastroenterology.

[90]  Gabriel S. Eichler,et al.  Cell fates as high-dimensional attractor states of a complex gene regulatory network. , 2005, Physical review letters.

[91]  Jonathan S. Packer,et al.  A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution , 2019, Science.

[92]  Enrico Guarnera,et al.  Exploring chromatin hierarchical organization via Markov State Modelling , 2018, PLoS Comput. Biol..

[93]  Xiangkai Li,et al.  Advances in Understanding How Heavy Metal Pollution Triggers Gastric Cancer , 2016, BioMed research international.

[94]  Jianfang Li,et al.  CEACAM6 Promotes Gastric Cancer Invasion and Metastasis by Inducing Epithelial-Mesenchymal Transition via PI3K/AKT Signaling Pathway , 2014, PloS one.