论文信息 - Fast and memory-efficient scRNA-seq k-means clustering with various distances

Fast and memory-efficient scRNA-seq k-means clustering with various distances

Single-cell RNA-sequencing (scRNA-seq) analyses typically begin by clustering a gene-by-cell expression matrix to empirically define groups of cells with similar expression profiles. We describe new methods and a new open source library, minicore, for efficient k-means++ center finding and k-means clustering of scRNA-seq data. Minicore works with sparse count data, as it emerges from typical scRNA-seq experiments, as well as with dense data from after dimensionality reduction. Minicore's novel vectorized weighted reservoir sampling algorithm allows it to find initial k-means++ centers for a 4-million cell dataset in 1.5 minutes using 20 threads. Minicore can cluster using Euclidean distance, but also supports a wider class of measures like Jensen-Shannon Divergence, Kullback-Leibler Divergence, and the Bhattacharyya distance, which can be directly applied to count data and probability distributions. Further, minicore produces lower-cost centerings more efficiently than scikit-learn for scRNA-seq datasets with millions of cells. With careful handling of priors, minicore implements these distance measures with only minor (<2-fold) speed differences among all distances. We show that a minicore pipeline consisting of k-means++, localsearch++ and mini-batch k-means can cluster a 4-million cell dataset in minutes, using less than 10GiB of RAM. This memory-efficiency enables atlas-scale clustering on laptops and other commodity hardware. Finally, we report findings on which distance measures give clusterings that are most consistent with known cell type labels. Availability: The open source library is at https://github.com/dnbaker/minicore. Code used for experiments is at https://github.com/dnbaker/minicore-experiments.

[1] Daniela M. Witten,et al. Classification and clustering of sequencing data using a poisson model , 2011, 1202.6201.

[2] Hao Hu,et al. An Efficient K-means Clustering Algorithm on MapReduce , 2014, DASFAA.

[3] Silvio Lattanzi,et al. A Better k-means++ Algorithm via Local Search , 2019, ICML.

[4] Euijoon Ahn,et al. Unsupervised Feature Learning with K-means and An Ensemble of Deep Convolutional Neural Networks for Medical Image Classification , 2019, ArXiv.

[5] Grace X. Y. Zheng,et al. Massively parallel digital transcriptional profiling of single cells , 2016, Nature Communications.

[6] S. Shalev-Shwartz,et al. Stochastic Gradient Descent , 2014 .

[7] Aleksandra A. Kolodziejczyk,et al. Accounting for technical noise in single-cell RNA-seq experiments , 2013, Nature Methods.

[8] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[9] Inderjit S. Dhillon,et al. Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[10] Shibiao Wan,et al. SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection , 2020, Genome research.

[11] Bonnie Berger,et al. Geometric Sketching Compactly Summarizes the Single-Cell Transcriptomic Landscape , 2019, bioRxiv.

[12] M. Varacallo,et al. 2019 , 2019, Journal of Surgical Orthopaedic Advances.

[13] Gaël Varoquaux,et al. Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14] Gorjan Alagic,et al. #p , 2019, Quantum information & computation.

[15] Charles Elkan,et al. Using the Triangle Inequality to Accelerate k-Means , 2003, ICML.

[16] Stephanie C. Hicks,et al. mbkmeans: Fast clustering for single cell data using mini-batch k-means , 2020, bioRxiv.

[17] Xinlei Chen,et al. Large Scale Spectral Clustering with Landmark-Based Representation , 2011, AAAI.

[18] Rafael A. Irizarry,et al. Feature selection and dimension reduction for single-cell RNA-Seq based on a multinomial model , 2019, Genome Biology.

[19] 2013 , 2018, Eu minha tía e o golpe do atraso.

[20] Kristie B. Hadden,et al. 2020 , 2020, Journal of Surgical Orthopaedic Advances.

[21] Andreas Krause,et al. Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[22] S. P. Lloyd,et al. Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[23] Yoshua Bengio,et al. Convergence Properties of the K-Means Algorithms , 1994, NIPS.

[24] Konstantin Makarychev,et al. Performance of Johnson-Lindenstrauss transform for k-means and k-medians clustering , 2018, STOC.

[25] Peter Sanders,et al. Communication-Efficient Weighted Reservoir Sampling from Fully Distributed Data Streams , 2020, SPAA.

[26] Andrei Novikov,et al. PyClustering: Data Mining Library , 2019, J. Open Source Softw..

[27] Sergei Vassilvitskii,et al. Scalable K-Means by ranked retrieval , 2014, WSDM.

[28] Yingyu Liang,et al. Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[29] Naoki Shibata,et al. SLEEF: A Portable Vectorized Library of C Standard Mathematical Functions , 2020, IEEE Transactions on Parallel and Distributed Systems.

[30] Michael J. T. Stubbington,et al. The Human Cell Atlas: from vision to reality , 2017, Nature.

[31] Danna Zhou,et al. d. , 1840, Microbial pathogenesis.

[32] Julian Jang,et al. MSD-Kmeans: A Novel Algorithm for Efficient Detection of Global and Local Outliers , 2019, ArXiv.

[33] Alexander Kmentt. 2017 , 2018, The Treaty Prohibiting Nuclear Weapons.

[34] Andrew J. Hill,et al. The single cell transcriptional landscape of mammalian organogenesis , 2019, Nature.

[35] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[36] Christoph Bock,et al. Ultra-high throughput single-cell RNA sequencing by combinatorial fluidic indexing , 2019, bioRxiv.

[37] William Stafford Noble,et al. Submodular sketches of single-cell RNA-seq measurements , 2020, bioRxiv.

[38] Hannah A. Pliner,et al. A human cell atlas of fetal gene expression , 2020, Science.

[39] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC '11.

[40] D. Sculley,et al. Web-scale k-means clustering , 2010, WWW '10.

[41] مسعود رسول آبادی,et al. 2011 , 2012, The Winning Cars of the Indianapolis 500.

[42] Benjamin DeMeo,et al. Hopper: a mathematically optimal algorithm for sketching biological data , 2019, bioRxiv.