On comparing clusterings: an element-centric framework unifies overlaps and hierarchy

Clustering is one of the most universal approaches for understanding complex data. A pivotal aspect of clustering analysis is quantitatively comparing clusterings; clustering comparison is the basis for tasks such as clustering evaluation, consensus clustering, and tracking the temporal evolution of clusters. For example, the extrinsic evaluation of clustering methods requires comparing the uncovered clusterings to planted clusterings or known metadata. Yet, as we demonstrate, existing clustering comparison measures have critical biases which un- dermine their usefulness, and no measure accommodates both overlapping and hierarchical clusterings. Here we unify the comparison of disjoint, overlapping, and hierarchically struc- tured clusterings by proposing a new element-centric framework: elements are compared based on the relationships induced by the cluster structure, as opposed to the traditional cluster-centric philosophy. We demonstrate that, in contrast to standard clustering simi- larity measures, our framework does not suffer from critical biases and naturally provides unique insights into how the clusterings differ. We illustrate the strengths of our framework by revealing new insights into the organization of clusters in two applications: the improved classification of schizophrenia based on the overlapping and hierarchical community struc- ture of fMRI brain networks, and the disentanglement of various social homophily factors in Facebook social networks. The universality of clustering suggests far-reaching impact of our framework throughout all areas of science.

[1]  Martin Rosvall,et al.  Comparing network covers using mutual information , 2012, ArXiv.

[2]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[3]  Ricardo J. G. B. Campello,et al.  A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment , 2007, Pattern Recognit. Lett..

[4]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[5]  M E J Newman,et al.  Modularity and community structure in networks. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Santo Fortunato,et al.  Community detection in networks: Structural communities versus ground truth , 2014, Physical review. E, Statistical, nonlinear, and soft matter physics.

[7]  Edward T. Bullmore,et al.  The discovery of population differences in network community structure: New methods and applications to brain functional networks in schizophrenia , 2012, NeuroImage.

[8]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[9]  Joaquín Goñi,et al.  On the origins of hierarchy in complex networks , 2013, Proceedings of the National Academy of Sciences.

[10]  David M. W. Powers,et al.  Characterization and evaluation of similarity measures for pairs of clusterings , 2009, Knowledge and Information Systems.

[11]  James Bailey,et al.  Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance , 2014, ICML.

[12]  Jean-Loup Guillaume,et al.  Fast unfolding of communities in large networks , 2008, 0803.0476.

[13]  Isabella Morlini,et al.  An Overall Index for Comparing Hierarchical Clusterings , 2010, GfKl.

[14]  D. Goff,et al.  Increased temporal and prefrontal activity in response to semantic associations in schizophrenia. , 2007, Archives of general psychiatry.

[15]  Mason A. Porter,et al.  Comparing Community Structure to Characteristics in Online Collegiate Social Networks , 2008, SIAM Rev..

[16]  Jon M. Kleinberg,et al.  An Impossibility Theorem for Clustering , 2002, NIPS.

[17]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[18]  Marián Boguñá,et al.  Extracting the multiscale backbone of complex weighted networks , 2009, Proceedings of the National Academy of Sciences.

[19]  Leto Peel,et al.  The ground truth about metadata and community detection in networks , 2016, Science Advances.

[20]  Roger Guimerà,et al.  Extracting the hierarchical organization of complex systems , 2007, Proceedings of the National Academy of Sciences.

[21]  P. Jaccard THE DISTRIBUTION OF THE FLORA IN THE ALPINE ZONE.1 , 1912 .

[22]  Pasi Fränti,et al.  Set Matching Measures for External Cluster Validity , 2016, IEEE Transactions on Knowledge and Data Engineering.

[23]  Santo Fortunato,et al.  Finding Statistically Significant Communities in Networks , 2010, PloS one.

[24]  Taher H. Haveliwala Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search , 2003, IEEE Trans. Knowl. Data Eng..

[25]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[26]  P. McGuire,et al.  Differential activation of temporal cortex during sentence completion in schizophrenic patients with and without formal thought disorder , 2001, Schizophrenia Research.

[27]  Jill P. Mesirov,et al.  A resampling-based method for class discovery and visualization of gene expression microarray data , 2003 .

[28]  David F. Gleich,et al.  Seeded PageRank solution paths , 2016, European Journal of Applied Mathematics.

[29]  Dániel Czégel,et al.  Random walk hierarchy measure: What is more hierarchical, a chain, a tree or a star? , 2015, Scientific Reports.

[30]  Clara Pizzuti,et al.  FOR CLOSENESS : ADJUSTING NORMALIZED MUTUAL INFORMATION MEASURE FOR CLUSTERING COMPARISON , 2016 .

[31]  Fevzi Alimo Methods of Combining Multiple Classiiers Based on Diierent Representations for Pen-based Handwritten Digit Recognition , 1996 .

[32]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[33]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[34]  Xenophon Papademetris,et al.  Groupwise whole-brain parcellation from resting-state fMRI data for network node identification , 2013, NeuroImage.

[35]  Ricardo J. G. B. Campello,et al.  Generalized external indexes for comparing data partitions with overlapping categories , 2010, Pattern Recognit. Lett..

[36]  Marina Meila,et al.  An Experimental Comparison of Model-Based Clustering Methods , 2004, Machine Learning.

[37]  Christos Faloutsos,et al.  Fast Random Walk with Restart and Its Applications , 2006, Sixth International Conference on Data Mining (ICDM'06).

[38]  Matthijs J. Warrens,et al.  Understanding information theoretic measures for comparing clusterings , 2018, Behaviormetrika.

[39]  Santo Fortunato,et al.  Consensus clustering in complex networks , 2012, Scientific Reports.

[40]  Sune Lehmann,et al.  Link communities reveal multiscale complexity in networks , 2009, Nature.

[41]  F. Radicchi,et al.  Benchmark graphs for testing community detection algorithms. , 2008, Physical review. E, Statistical, nonlinear, and soft matter physics.

[42]  Allan P. White,et al.  Technical Note: Bias in Information-Based Measures in Decision Tree Induction , 1994, Machine Learning.

[43]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[44]  Yi-Cheng Zhang,et al.  Bipartite network projection and personal recommendation. , 2007, Physical review. E, Statistical, nonlinear, and soft matter physics.

[45]  James Bailey,et al.  Information theoretic measures for clusterings comparison: is a correction for chance necessary? , 2009, ICML '09.

[46]  Mason A. Porter,et al.  Robust Detection of Dynamic Community Structure in Networks , 2012, Chaos.

[47]  Cristopher Moore,et al.  Scalable detection of statistically significant communities and hierarchies, using message passing for modularity , 2014, Proceedings of the National Academy of Sciences.

[48]  C. Mallows,et al.  A Method for Comparing Two Hierarchical Clusterings , 1983 .

[49]  Ahmed Albatineh,et al.  Correcting Jaccard and other similarity indices for chance agreement in cluster analysis , 2011, Adv. Data Anal. Classif..

[50]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[51]  Jon M. Kleinberg,et al.  Block models and personalized PageRank , 2016, Proceedings of the National Academy of Sciences.

[52]  Gesine Reinert,et al.  Estimating the number of communities in a network , 2016, Physical review letters.

[53]  Marina Meila,et al.  Comparing Clusterings by the Variation of Information , 2003, COLT.

[54]  Mason A. Porter,et al.  Social Structure of Facebook Networks , 2011, ArXiv.

[55]  Lawrence Hubert,et al.  The variance of the adjusted Rand index. , 2016, Psychological methods.

[56]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[57]  Andrea Lancichinetti,et al.  Community detection algorithms: a comparative analysis: invited presentation, extended abstract , 2009, VALUETOOLS.

[58]  Jill P. Mesirov,et al.  Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data , 2003, Machine Learning.

[59]  Isabelle Guyon,et al.  A Stability Based Method for Discovering Structure in Clustered Data , 2001, Pacific Symposium on Biocomputing.

[60]  Marina Meila,et al.  Comparing clusterings: an axiomatic view , 2005, ICML.

[61]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[62]  Yong-Yeol Ahn,et al.  The Impact of Random Models on Clustering Similarity , 2017, bioRxiv.

[63]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[64]  L. Collins,et al.  Omega: A General Formulation of the Rand Index of Cluster Recovery Suitable for Non-disjoint Solutions. , 1988, Multivariate behavioral research.

[65]  Ivan G. Costa,et al.  A Comparison of External Clustering Evaluation Indices in the Context of Imbalanced Data Sets , 2012, 2012 Brazilian Symposium on Neural Networks.

[66]  Jure Leskovec,et al.  Structure and Overlaps of Ground-Truth Communities in Networks , 2014, TIST.

[67]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[68]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[69]  Ahmed Albatineh,et al.  On Similarity Indices and Correction for Chance Agreement , 2006, J. Classif..

[70]  Paul Van Dooren,et al.  A MEASURE OF SIMILARITY BETWEEN GRAPH VERTICES . WITH APPLICATIONS TO SYNONYM EXTRACTION AND WEB SEARCHING , 2002 .

[71]  Jean-Charles Delvenne,et al.  Stability of graph communities across time scales , 2008, Proceedings of the National Academy of Sciences.

[72]  Jure Leskovec,et al.  Community-Affiliation Graph Model for Overlapping Network Community Detection , 2012, 2012 IEEE 12th International Conference on Data Mining.

[73]  Temple F. Smith,et al.  On the similarity of dendrograms. , 1978, Journal of theoretical biology.

[74]  Guido Caldarelli,et al.  Hierarchical mutual information for the comparison of hierarchical community structures in complex networks , 2015, Physical review. E, Statistical, nonlinear, and soft matter physics.

[75]  Hugo Steinhaus,et al.  On a certain distance of sets and the corresponding distance of functions , 1958 .

[76]  Jure Leskovec,et al.  Community Structure in Large Networks: Natural Cluster Sizes and the Absence of Large Well-Defined Clusters , 2008, Internet Math..

[77]  Clara Pizzuti,et al.  Is normalized mutual information a fair measure for comparing community detection methods? , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[78]  Ashish Goel,et al.  FAST-PPR: scaling personalized pagerank estimation for large graphs , 2014, KDD.

[79]  T. Vicsek,et al.  Uncovering the overlapping community structure of complex networks in nature and society , 2005, Nature.

[80]  Haiyuan Yu,et al.  Detecting overlapping protein complexes in protein-protein interaction networks , 2012, Nature Methods.

[81]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[82]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[83]  James Bailey,et al.  Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance , 2010, J. Mach. Learn. Res..

[84]  Boleslaw K. Szymanski,et al.  Overlapping community detection in networks: The state-of-the-art and comparative study , 2011, CSUR.

[85]  Joaquín Goñi,et al.  Nodal centrality of functional network in the differentiation of schizophrenia , 2015, Schizophrenia Research.

[86]  David M Blei,et al.  Efficient discovery of overlapping communities in massive networks , 2013, Proceedings of the National Academy of Sciences.

[87]  Glenn Fung,et al.  On the Dangers of Cross-Validation. An Experimental Evaluation , 2008, SDM.

[88]  Edward T. Bullmore,et al.  Schizophrenia, neuroimaging and connectomics , 2012, NeuroImage.