densityCut: an efficient and versatile topological approach for automatic clustering of biological data

Motivation: Many biological data processing problems can be formalized as clustering problems to partition data points into sensible and biologically interpretable groups. Results: This article introduces densityCut, a novel density-based clustering algorithm, which is both time- and space-efficient and proceeds as follows: densityCut first roughly estimates the densities of data points from a K-nearest neighbour graph and then refines the densities via a random walk. A cluster consists of points falling into the basin of attraction of an estimated mode of the underlining density function. A post-processing step merges clusters and generates a hierarchical cluster tree. The number of clusters is selected from the most stable clustering in the hierarchical cluster tree. Experimental results on ten synthetic benchmark datasets and two microarray gene expression datasets demonstrate that densityCut performs better than state-of-the-art algorithms for clustering biological datasets. For applications, we focus on the recent cancer mutation clustering and single cell data analyses, namely to cluster variant allele frequencies of somatic mutations to reveal clonal architectures of individual tumours, to cluster single-cell gene expression data to uncover cell population compositions, and to cluster single-cell mass cytometry data to detect communities of cells of the same functional states or types. densityCut performs better than competing algorithms and is scalable to large datasets. Availability and Implementation: Data and the densityCut R package is available from https://bitbucket.org/jerry00/densitycut_dev. Contact: condon@cs.ubc.ca or sshah@bccrc.ca or jiaruid@cs.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Anil K. Jain Data Clustering: User's Dilemma , 2007, MLDM.

[2]  William W. Cohen,et al.  Power Iteration Clustering , 2010, ICML.

[3]  Aristides Gionis,et al.  Clustering aggregation , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[5]  Alexander Schliep,et al.  Clustering cancer gene expression data: a comparative study , 2008, BMC Bioinformatics.

[6]  Sean C. Bendall,et al.  Data-Driven Phenotypic Dissection of AML Reveals Progenitor-like Cells that Correlate with Prognosis , 2015, Cell.

[7]  Enric Llorens-Bobadilla,et al.  Single-Cell Transcriptomics Reveals a Population of Dormant Neural Stem Cells that Become Activated upon Brain Injury. , 2015, Cell stem cell.

[8]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[9]  A. Bouchard-Côté,et al.  PyClone: statistical inference of clonal population structure in cancer , 2014, Nature Methods.

[10]  Joshua F. McMichael,et al.  Clonal evolution in relapsed acute myeloid leukemia revealed by whole genome sequencing , 2011, Nature.

[11]  Sean Hughes,et al.  Clustering by Fast Search and Find of Density Peaks , 2016 .

[12]  Sanjoy Dasgupta,et al.  Rates of convergence for the cluster tree , 2010, NIPS.

[13]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[14]  Liu Xianming,et al.  A Time Petri Net Extended with Price Information , 2007 .

[15]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[16]  Pasi Fränti,et al.  Iterative shrinking method for clustering problems , 2006, Pattern Recognit..

[17]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[18]  Sean C. Bendall,et al.  Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum , 2011, Science.

[19]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[20]  Joshua F. McMichael,et al.  Optimizing cancer genome sequencing and analysis. , 2015, Cell systems.

[21]  Keinosuke Fukunaga,et al.  A Graph-Theoretic Approach to Nonparametric Cluster Analysis , 1976, IEEE Transactions on Computers.

[22]  Ming Zhang,et al.  Comparing sequences without using alignments: application to HIV/SIV subtyping , 2007, BMC Bioinformatics.

[23]  Li Ding,et al.  Clonal Architectures and Driver Mutations in Metastatic Melanomas , 2014, PloS one.

[24]  Cor J. Veenman,et al.  A Maximum Variance Cluster Algorithm , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[25]  Yizong Cheng,et al.  Mean Shift, Mode Seeking, and Clustering , 1995, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Dit-Yan Yeung,et al.  Robust path-based spectral clustering , 2008, Pattern Recognit..

[27]  Geoffrey J. McLachlan,et al.  Mixtures of common t-factor analyzers for clustering high-dimensional microarray data , 2011, Bioinform..

[28]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[29]  Silke Wagner,et al.  Comparing Clusterings - An Overview , 2007 .

[30]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[31]  Ana L. N. Fred,et al.  Robust data clustering , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[32]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[33]  D. Mount ANN Programming Manual , 1998 .

[34]  Limin Fu,et al.  FLAME, a novel fuzzy clustering method for the analysis of DNA microarray data , 2007, BMC Bioinformatics.

[35]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[36]  Martin Ester,et al.  Density‐based clustering , 2019, WIREs Data Mining Knowl. Discov..

[37]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[39]  Ulrike von Luxburg,et al.  Pruning nearest neighbor cluster trees , 2011, ICML.

[40]  Adrian E. Raftery,et al.  Model-based Methods of Classification: Using the mclust Software in Chemometrics , 2007 .

[41]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[42]  Harald Binder,et al.  Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models , 2008, BMC Bioinformatics.

[43]  Christopher A. Miller,et al.  Clonal Evolution Revealed by Whole Genome Sequencing in a Case of Primary Myelofibrosis Transformed to Secondary Acute Myeloid Leukemia , 2012, Leukemia.

[44]  Sanjoy Dasgupta,et al.  Optimal rates for k-NN density and mode estimation , 2014, NIPS.

[45]  W. Stuetzle,et al.  A Generalized Single Linkage Method for Estimating the Cluster Tree of a Density , 2010 .

[46]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[47]  Chen Xu,et al.  Identification of cell types from single-cell transcriptomes using a novel clustering method , 2015, Bioinform..

[48]  Yi Zhang,et al.  Prognostic gene expression signatures can be measured in tissues collected in RNAlater preservative. , 2006, The Journal of molecular diagnostics : JMD.

[49]  Stefano Soatto,et al.  Quick Shift and Kernel Methods for Mode Seeking , 2008, ECCV.

[50]  Alex A. Pollen,et al.  Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex , 2014, Nature Biotechnology.

[51]  Obi L. Griffith,et al.  SciClone: Inferring Clonal Architecture and Tracking the Spatial and Temporal Patterns of Tumor Evolution , 2014, PLoS Comput. Biol..

[52]  Giovanna Menardi,et al.  An advancement in clustering via nonparametric density estimation , 2014, Stat. Comput..

[53]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[54]  Dorothea Emig,et al.  Partitioning biological data with transitivity clustering , 2010, Nature Methods.

[55]  M. Cugmas,et al.  On comparing partitions , 2015 .