Fast and memory-efficient scRNA-seq k-means clustering with various distances

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

[1]  Daniela M. Witten,et al.  Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[2]  Hao Hu,et al.  An Efficient K-means Clustering Algorithm on MapReduce , 2014, DASFAA.

[3]  Silvio Lattanzi,et al.  A Better k-means++ Algorithm via Local Search , 2019, ICML.

[4]  Euijoon Ahn,et al.  Unsupervised Feature Learning with K-means and An Ensemble of Deep Convolutional Neural Networks for Medical Image Classification , 2019, ArXiv.

[5]  Grace X. Y. Zheng,et al.  Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[6]  S. Shalev-Shwartz,et al.  Stochastic Gradient Descent , 2014 .

[7]  Aleksandra A. Kolodziejczyk,et al.  Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[8]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[9]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[10]  Shibiao Wan,et al.  SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection , 2020, Genome research.

[11]  Bonnie Berger,et al.  Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape , 2019, bioRxiv.

[12]  M. Varacallo,et al.  2019 , 2019, Journal of Surgical Orthopaedic Advances.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Gorjan Alagic,et al.  #p , 2019, Quantum information & computation.

[15]  Charles Elkan,et al.  Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[16]  Stephanie C. Hicks,et al.  mbkmeans: Fast clustering for single cell data using mini-batch k-means , 2020, bioRxiv.

[17]  Xinlei Chen,et al.  Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[18]  Rafael A. Irizarry,et al.  Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[19]  2013 , 2018, Eu minha tía e o golpe do atraso.

[20]  Kristie B. Hadden,et al.  2020 , 2020, Journal of Surgical Orthopaedic Advances.

[21]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[22]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23]  Yoshua Bengio,et al.  Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[24]  Konstantin Makarychev,et al.  Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[25]  Peter Sanders,et al.  Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams , 2020, SPAA.

[26]  Andrei Novikov,et al.  PyClustering: Data Mining Library , 2019, J. Open Source Softw..

[27]  Sergei Vassilvitskii,et al.  Scalable K-Means by ranked retrieval , 2014, WSDM.

[28]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[29]  Naoki Shibata,et al.  SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions , 2020, IEEE Transactions on Parallel and Distributed Systems.

[30]  Michael J. T. Stubbington,et al.  The Human Cell Atlas: from vision to reality , 2017, Nature.

[31]  Danna Zhou,et al.  d. , 1840, Microbial pathogenesis.

[32]  Julian Jang,et al.  MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers , 2019, ArXiv.

[33]  Alexander Kmentt 2017 , 2018, The Treaty Prohibiting Nuclear Weapons.

[34]  Andrew J. Hill,et al.  The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[35]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[36]  Christoph Bock,et al.  Ultra-high throughput single-cell RNA sequencing by combinatorial fluidic indexing , 2019, bioRxiv.

[37]  William Stafford Noble,et al.  Submodular sketches of single-cell RNA-seq measurements , 2020, bioRxiv.

[38]  Hannah A. Pliner,et al.  A human cell atlas of fetal gene expression , 2020, Science.

[39]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[40]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[41]  مسعود رسول آبادی,et al.  2011 , 2012, The Winning Cars of the Indianapolis 500.

[42]  Benjamin DeMeo,et al.  Hopper: a mathematically optimal algorithm for sketching biological data , 2019, bioRxiv.