An Evaluative Measure of Clustering Methods Incorporating Hyperparameter Sensitivity

Clustering algorithms are often evaluated using metrics which compare with ground-truth cluster assignments, such as Rand index and NMI. Algorithm performance may vary widely for different hyperparameters, however, and thus model selection based on optimal performance for these metrics is discordant with how these algorithms are applied in practice, where labels are unavailable and tuning is often more art than science. It is therefore desirable to compare clustering algorithms not only on their optimally tuned performance, but also some notion of how realistic it would be to obtain this performance in practice. We propose an evaluation of clustering methods capturing this ease-of-tuning by modeling the expected best clustering score under a given computation budget. To encourage the adoption of the proposed metric alongside classic clustering evaluations, we provide an extensible benchmarking framework. We perform an extensive empirical evaluation of our proposed metric on popular clustering algorithms over a large collection of datasets from different domains, and observe that our new metric leads to several noteworthy observations.

[1]  Alexander M. Rush,et al.  Datasets: A Community Library for Natural Language Processing , 2021, EMNLP.

[2]  Andrew McCallum,et al.  Exact and Approximate Hierarchical Clustering Using A , 2021, UAI.

[3]  M. Calame,et al.  Benchmark and application of unsupervised classification approaches for univariate data , 2021, Communications Physics.

[4]  Tie-Yan Liu,et al.  MPNet: Masked and Permuted Pre-training for Language Understanding , 2020, NeurIPS.

[5]  Lili Blumenberg,et al.  Hypercluster: a flexible tool for parallelized unsupervised clustering optimization , 2020, bioRxiv.

[6]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[7]  J. N. Rijn,et al.  OpenML-Python: an extensible Python API for OpenML , 2019, J. Mach. Learn. Res..

[8]  Hayden Kwok-Hay So,et al.  PARC: ultrafast and accurate clustering of phenotypic data of millions of single cells , 2019, bioRxiv.

[9]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[12]  Marc Najork,et al.  Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models , 2019, WSDM.

[13]  I. Guyon,et al.  Benchmarking in cluster analysis: A white paper , 2018, 1809.10496.

[14]  P. Fränti,et al.  K-means properties on six clustering benchmark datasets , 2018, Applied Intelligence.

[15]  Ion Stoica,et al.  Tune: A Research Platform for Distributed Model Selection and Training , 2018, ArXiv.

[16]  Shai Ben-David,et al.  Clustering - What Both Theoreticians and Practitioners Are Doing Wrong , 2018, AAAI.

[17]  Partha Talukdar,et al.  CESI: Canonicalizing Open Knowledge Bases using Embeddings and Side Information , 2018, WWW.

[18]  J. N. Rijn,et al.  OpenML Benchmarking Suites , 2017, NeurIPS Datasets and Benchmarks.

[19]  Akshay Krishnamurthy,et al.  A Hierarchical Algorithm for Extreme Clustering , 2017, KDD.

[20]  Rob Hall,et al.  A Bayesian Approach to Graphical Record Linkage and Deduplication , 2016 .

[21]  Stephen G. Kobourov,et al.  Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale , 2016, PloS one.

[22]  Shai Ben-David,et al.  Clustering is Easy When ....What? , 2015, ArXiv.

[23]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.

[24]  Matt Barnes,et al.  A Practioner's Guide to Evaluating Entity Resolution Results , 2015, ArXiv.

[25]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[26]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[27]  James P. Bridge,et al.  Machine Learning for First-Order Theorem Proving , 2014, Journal of Automated Reasoning.

[28]  Kevin Leyton-Brown,et al.  An Efficient Approach for Assessing Hyperparameter Importance , 2014, ICML.

[29]  Luís Torgo,et al.  OpenML: networked science in machine learning , 2014, SKDD.

[30]  Ricardo J. G. B. Campello,et al.  Density-Based Clustering Based on Hierarchical Density Estimates , 2013, PAKDD.

[31]  Michael I. Jordan,et al.  Small-Variance Asymptotics for Exponential Family Dirichlet Process Mixture Models , 2012, NIPS.

[32]  David A. Bader,et al.  Parallel Community Detection for Massive Graphs , 2011, PPAM.

[33]  Idan Szpektor,et al.  I want to answer; who has a question?: Yahoo! answers recommender system , 2011, KDD.

[34]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[35]  Sanjoy Dasgupta,et al.  Rates of convergence for the cluster tree , 2010, NIPS.

[36]  D. Sculley,et al.  Web-scale k-means clustering , 2010, WWW '10.

[37]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[38]  Isabelle Guyon,et al.  Clustering: Science or Art? , 2009, ICML Unsupervised and Transfer Learning.

[39]  Nir Ailon,et al.  Aggregating inconsistent information: Ranking and clustering , 2008 .

[40]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[41]  Jens Lehmann,et al.  DBpedia: A Nucleus for a Web of Open Data , 2007, ISWC/ASWC.

[42]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[43]  Ujjwal Maulik,et al.  Performance Evaluation of Some Clustering Algorithms and Validity Indices , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[44]  Vladimir Estivill-Castro,et al.  Why so many clustering algorithms: a position paper , 2002, SKDD.

[45]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[46]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[47]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[48]  Olvi L. Mangasarian,et al.  Nuclear feature extraction for breast tumor diagnosis , 1993, Electronic Imaging.

[49]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[50]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[51]  R. Sokal,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification. , 1975 .

[52]  Peter H. A. Sneath,et al.  Numerical Taxonomy: The Principles and Practice of Numerical Classification , 1973 .

[53]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[54]  Daniel R. Jiang,et al.  BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization , 2020, NeurIPS.

[55]  Viacheslav Shalamov,et al.  Reinforcement-based Method for Simultaneous Clustering Algorithm Selection and its Hyperparameters Optimization , 2018 .

[56]  B. Karrer,et al.  AE: A domain-agnostic platform for adaptive experimentation , 2018 .

[57]  Silvio Lattanzi,et al.  Affinity Clustering: Hierarchical Clustering at Scale , 2017, NIPS.

[58]  M. R E C A S E,et al.  BLANC: Implementing the Rand index for coreference evaluation , 2010, Natural Language Engineering.

[59]  Tzong-Jer Chen,et al.  Fuzzy c-means clustering with spatial information for image segmentation , 2006, Comput. Medical Imaging Graph..

[60]  George Karypis,et al.  gCLUTO – An Interactive Clustering, Visualization, and Analysis System , 2004 .

[61]  Ah-Hwee Tan,et al.  On Quantitative Evaluation of Clustering Systems , 2003, Clustering and Information Retrieval.

[62]  Mark P. Sinka,et al.  A Large Benchmark Dataset for Web Document Clustering , 2002 .

[63]  Breck Baldwin,et al.  Algorithms for Scoring Coreference Chains , 1998 .

[64]  Ilya M. Sobol,et al.  Sensitivity Estimates for Nonlinear Mathematical Models , 1993 .

[65]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..