How Many Clusters? An Entropic Approach to Hierarchical Cluster Analysis

Clustering large and heterogeneous data of user-profiles from social media is problematic as the problem of finding the optimal number of clusters becomes more critical than for clustering smaller and homogeneous data. We propose a new approach based on the deformed Renyi entropy for determining the optimal number of clusters in hierarchical clustering of user-profile data. Our results show that this approach allows us to estimate Renyi entropy for each level of a hierarchical model and find the entropy minimum (information maximum). Our approach also shows that solutions with the lowest and the highest number of clusters correspond to the entropy maxima (minima of information).

[1]  Sankaran Mahadevan,et al.  A new structure entropy of complex networks based on Tsallis nonextensive statistical mechanics , 2014, ArXiv.

[2]  Lev Muchnik,et al.  Identifying influential spreaders in complex networks , 2010, 1001.5285.

[3]  Ángel Fernando Kuri Morales,et al.  A Clustering Method Based on the Maximum Entropy Principle , 2015, Entropy.

[4]  K. Selçuk Candan,et al.  How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? , 2010, ICWSM.

[6]  Santo Fortunato,et al.  Community detection in graphs , 2009, ArXiv.

[7]  Rose,et al.  Statistical mechanics and phase transitions in clustering. , 1990, Physical review letters.

[8]  Catherine A. Sugar,et al.  Finding the Number of Clusters in a Dataset , 2003 .

[9]  Mark Newman,et al.  Models of the Small World , 2000 .

[10]  Jiming Liu,et al.  Inferring Motif-Based Diffusion Models for Social Networks , 2016, IJCAI.

[11]  Olessia Koltsova,et al.  Estimating Topic Modeling Performance with Sharma–Mittal Entropy , 2019, Entropy.

[12]  Jose M. Such,et al.  Open Challenges in Relationship-Based Privacy Mechanisms for Social Network Services , 2015, Int. J. Hum. Comput. Interact..

[13]  Alexander Kupriyanov,et al.  Clustering of social media content with the use of BigData technology , 2018 .

[14]  Hakim Hacid,et al.  A predictive model for the temporal dynamics of information diffusion in online social networks , 2012, WWW.

[15]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[16]  Béla Bollobás,et al.  Mathematical results on scale‐free random graphs , 2005 .

[17]  David J. Ketchen,et al.  THE APPLICATION OF CLUSTER ANALYSIS IN STRATEGIC MANAGEMENT RESEARCH: AN ANALYSIS AND CRITIQUE , 1996 .

[18]  Daniel Barbará,et al.  On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[19]  B. Bollobás The evolution of random graphs , 1984 .

[20]  R. Guimerà,et al.  Functional cartography of complex metabolic networks , 2005, Nature.

[21]  Yun Wang,et al.  A Cascading Diffusion Prediction Model in Micro-blog Based on Multi-dimensional Features , 2017, EIDWT.

[22]  M. Dehmer,et al.  Analysis of Complex Networks: From Biology to Linguistics , 2009 .

[23]  Christian Beck,et al.  Generalised information and entropy measures in physics , 2009, 0902.1235.

[24]  Ernesto Estrada Spectral theory of networks : from biomolecular to ecological systems , 2009 .

[25]  Juan Liu,et al.  Characterizing user behavior and information propagation on a social multimedia network , 2013, 2013 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[26]  Sergio Hernández,et al.  A Brief Review of Generalized Entropies , 2018, Entropy.

[27]  B. B. Murphy,et al.  Entropy in the hierarchical cluster analysis of hospitals. , 1978, Health services research.

[28]  Sergei Koltcov,et al.  Application of Rényi and Tsallis entropies to topic modeling optimization , 2018, Physica A: Statistical Mechanics and its Applications.