progenyClust: an R package for Progeny Clustering

Identifying the optimal number of clusters is a common problem faced by data scientists in various research fields and industry applications. Though many clustering evaluation techniques have been developed to solve this problem, the recently developed algorithm Progeny Clustering is a much faster alternative and one that is relevant to biomedical applications. In this paper, we introduce an R package progenyClust that implements and extends the original Progeny Clustering algorithm for evaluating clustering stability and identifying the optimal cluster number. We illustrate its applicability using two examples: a simulated test dataset for proof-of-concept, and a cell imaging dataset for demonstrating its application potential in biomedical research. The progenyClust package is versatile in that it offers great flexibility for picking methods and tuning parameters. In addition, the default parameter setting as well as the plot and summary methods offered in the package make the application of Progeny Clustering straightforward and coherent.

[1]  J. West,et al.  Fabrication of Multifaceted Micropatterned Surfaces with Laser Scanning Lithography , 2011, Advanced functional materials.

[2]  Olatz Arbelaitz,et al.  An extensive comparative study of cluster validity indices , 2013, Pattern Recognit..

[3]  Chenyue W. Hu,et al.  Recapitulation and Modulation of the Cellular Architecture of a User-Chosen Cell of Interest Using Cell-Derived, Biomimetic Patterning. , 2015, ACS nano.

[4]  S. Ross,et al.  Segmenting sport fans using brand associations: A cluster analysis , 2007 .

[5]  Malika Charrad,et al.  NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set , 2014 .

[6]  Delbert Dueck,et al.  Clustering by Passing Messages Between Data Points , 2007, Science.

[7]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[8]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. C. Johnson Hierarchical clustering schemes , 1967, Psychometrika.

[10]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[11]  Xiaogang Wang,et al.  A roadmap of clustering algorithms: finding a match for a biomedical application , 2008, Briefings Bioinform..

[12]  Sabine Tejpar,et al.  Gene expression patterns unveil a new level of molecular heterogeneity in colorectal cancer , 2013, The Journal of pathology.

[13]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[14]  Chenyue W. Hu,et al.  Progeny Clustering: A Method to Identify Biological Phenotypes , 2015, Scientific Reports.

[15]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[16]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.