Shrinkage Clustering: a fast and size-constrained clustering algorithm for biomedical applications

Motivation: Many common clustering algorithms require a two-step process that limits their efficiency. The algorithms need to be performed repetitively and need to be implemented together with a model selection criterion, in order to determine both the number of clusters present in the data and the corresponding cluster memberships. As biomedical datasets increase in size and prevalence, there is a growing need for new methods that are more convenient to implement and are more computationally efficient. In addition, it is often essential to obtain clusters of sufficient sample size to make the clustering result meaningful and interpretable for subsequent analysis. Results: We introduce Shrinkage Clustering, a novel clustering algorithm based on matrix factorization that simultaneously finds the optimal number of clusters while partitioning the data. We report its performances across multiple simulated and actual datasets, and demonstrate its strength in accuracy and speed in application to subtyping cancer and brain tissues. In addition, the algorithm offers a straightforward solution to clustering with cluster size constraints. Given its ease of implementation, computing efficiency and extensible structure, we believe Shrinkage Clustering can be applied broadly to solve biomedical clustering tasks especially when dealing with large datasets.

[1]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[2]  Thomas Lengauer,et al.  Towards the Identification of Cancer Subtypes by Integrative Clustering of Molecular Data , 2012 .

[3]  C. Ding,et al.  On the Equivalence of Nonnegative Matrix Factorization and K-means - Spectral Clustering , 2005 .

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[6]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[7]  Peter Kulchyski and , 2015 .

[8]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[9]  Joachim M. Buhmann,et al.  Stability-Based Validation of Clustering Solutions , 2004, Neural Computation.

[10]  Ayhan Demiriz,et al.  Constrained K-Means Clustering , 2000 .

[11]  Chenyue W. Hu,et al.  Progeny Clustering: A Method to Identify Biological Phenotypes , 2015, Scientific Reports.

[12]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[13]  Kurt Hornik,et al.  kernlab - An S4 Package for Kernel Methods in R , 2004 .

[14]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[15]  D. N. Geary Mixture Models: Inference and Applications to Clustering , 1989 .

[16]  Chris H. Q. Ding,et al.  Symmetric Nonnegative Matrix Factorization for Graph Clustering , 2012, SDM.

[17]  Wagner Meira,et al.  Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering , 2016, PLoS Comput. Biol..

[18]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[19]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[20]  T. Montine,et al.  Adult Changes in Thought study: dementia is an individually varying convergent syndrome with prevalent clinically silent diseases that may be modified by some commonly used therapeutics. , 2012, Current Alzheimer research.

[21]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[22]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[23]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[24]  Pablo Tamayo,et al.  Metagenes and molecular pattern discovery using matrix factorization , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Chris H. Q. Ding,et al.  Nonnegative Matrix Factorizations for Clustering: A Survey , 2018, Data Clustering: Algorithms and Applications.

[26]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[27]  William Nick Street,et al.  Breast Cancer Diagnosis and Prognosis Via Linear Programming , 1995, Oper. Res..

[28]  D. Curran‐Everett,et al.  Identification of asthma phenotypes using cluster analysis in the Severe Asthma Research Program. , 2010, American journal of respiratory and critical care medicine.

[29]  Olvi L. Mangasarian,et al.  Nuclear feature extraction for breast tumor diagnosis , 1993, Electronic Imaging.

[30]  Jan Baumbach,et al.  Comparing the performance of biomedical clustering methods , 2015, Nature Methods.

[31]  Chenyue W. Hu,et al.  Recapitulation and Modulation of the Cellular Architecture of a User-Chosen Cell of Interest Using Cell-Derived, Biomimetic Patterning. , 2015, ACS nano.

[32]  J. Wisell,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2010 .