A Hierarchical Clustering Algorithm for Characterizing Social Media Users

In this paper we propose a method to characterize user behavior from their engagement with enterprise social media. Content analysis often suffers challenges due to noise. Here we study behavior using temporal activity, i.e., the number of posts per month represented as a time series. User posting volume on social media has a long tailed nature. It causes time series clustering algorithms to result in unbalanced clusters with either very few users or almost all users. Thus we propose a hierarchical time series clustering algorithm to group users according to their behavioral homogeneity and provide interpretable characterizations to the resulting clusters. Users in distinct clusters deviate significantly in their topics of interest while being homophilic (near identical or similar minded) within the cluster. Goodness of the clustering is observed over Enterprise Social Media (ESM); Stackexchange; and Linux Kernel Mailing List (LKML) datasets as opposed to existing clustering techniques.

[1]  Victor O. K. Li,et al.  Temporal Behavior of Social Network Users in Information Diffusion , 2014, 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT).

[2]  Ryan L. Boyd,et al.  Language-based personality: a new approach to personality in a digital world , 2017, Current Opinion in Behavioral Sciences.

[3]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[4]  Lars Schmidt-Thieme,et al.  Learning time-series shapelets , 2014, KDD.

[5]  Jason Lines,et al.  Transformation Based Ensembles for Time Series Classification , 2012, SDM.

[6]  Donald W. Bouldin,et al.  A Cluster Separation Measure , 1979, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Pabitra Mitra,et al.  Mining HEXACO personality traits from Enterprise Social Media , 2015, WASSA@EMNLP.

[8]  Daniel Schneider,et al.  Differentiating Communication Styles of Leaders on the Linux Kernel Mailing List , 2016, OpenSym.

[9]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[10]  Lipika Dey,et al.  An Ontology-Based Mining of Consumer Feedbacks Using Fuzzy Reasoning , 2012, 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology.

[11]  Jason Lines,et al.  Time-Series Classification with COTE: The Collective of Transformation-Based Ensembles , 2015, IEEE Transactions on Knowledge and Data Engineering.

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  E. David,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World , 2010 .

[14]  Machiko Toyoda,et al.  Pattern discovery in data streams under the time warping distance , 2012, The VLDB Journal.

[15]  Chris Arney,et al.  Networks, Crowds, and Markets: Reasoning about a Highly Connected World (Easley, D. and Kleinberg, J.; 2010) [Book Review] , 2013, IEEE Technology and Society Magazine.

[16]  T. Graepel,et al.  Private traits and attributes are predictable from digital records of human behavior , 2013, Proceedings of the National Academy of Sciences.

[17]  Lipika Dey,et al.  Discovering regular and consistent behavioral patterns in topical tweeting , 2012, Proceedings of the 21st International Conference on Pattern Recognition (ICPR2012).

[18]  Toni Giorgino,et al.  Computing and Visualizing Dynamic Time Warping Alignments in R: The dtw Package , 2009 .

[19]  Jason Lines,et al.  Classification of time series by shapelet transformation , 2013, Data Mining and Knowledge Discovery.

[20]  Nick S. Jones,et al.  Highly Comparative Feature-Based Time-Series Classification , 2014, IEEE Transactions on Knowledge and Data Engineering.

[21]  M. Kosinski,et al.  Computer-based personality judgments are more accurate than those made by humans , 2015, Proceedings of the National Academy of Sciences.