Clustering memes in social media streams

AbstractThe problem of clustering content in social media has pervasive applications, including the identification of discussion topics, event detection, and content recommendation. Here, we describe a streaming framework for online detection and clustering of memes in social media, specifically Twitter. A pre-clustering procedure, namely protomeme detection, first isolates atomic tokens of information carried by the tweets. Protomemes are thereafter aggregated, based on multiple similarity measures, to obtain memes as cohesive groups of tweets reflecting actual concepts or topics of discussion. The clustering algorithm takes into account various dimensions of the data and metadata, including natural language, the social network, and the patterns of information diffusion. As a result, our system can build clusters of semantically, structurally, and topically related tweets. The clustering process is based on a variant of Online K-means that incorporates a memory mechanism, used to “forget” old memes and replace them over time with the new ones. The evaluation of our framework is carried out using a dataset of Twitter trending topics. Over a 1-week period, we systematically determined whether our algorithm was able to recover the trending hashtags. We show that the proposed method outperforms baseline algorithms that only use content features, as well as a state-of-the-art event detection method that assumes full knowledge of the underlying follower network. We finally show that our online learning framework is flexible, due to its independence of the adopted clustering algorithm, and best suited to work in a streaming scenario.

[1]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[2]  O. P. Vyas,et al.  Data Stream Mining: A Review on Windowing Approach , 2012 .

[3]  Eni Mustafaraj,et al.  From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search , 2010 .

[4]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[5]  A HubermanBernardo,et al.  Usage patterns of collaborative tagging systems , 2006 .

[6]  Gerhard J. Woeginger,et al.  Online Algorithms , 1998, Lecture Notes in Computer Science.

[7]  Joydeep Ghosh,et al.  Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres , 2004, IEEE Transactions on Neural Networks.

[8]  Jacob Ratkiewicz,et al.  Truthy: mapping the spread of astroturf in microblog streams , 2010, WWW.

[9]  Shai Shalev-Shwartz,et al.  Online Learning and Online Convex Optimization , 2012, Found. Trends Mach. Learn..

[10]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[11]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[12]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[13]  Marko M. Skoric,et al.  Online Organization of an Offline Protest: From Social to Traditional Media and Back , 2011, 2011 44th Hawaii International Conference on System Sciences.

[14]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[15]  Alessandro Flammini,et al.  Optimal network clustering for information diffusion , 2014, Physical review letters.

[16]  João Gama,et al.  Learning from Data Streams , 2009, Encyclopedia of Data Warehousing and Mining.

[17]  Ciro Cattuto,et al.  Dynamical classes of collective attention in twitter , 2011, WWW.

[18]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[19]  Avrim Blum,et al.  On-line Algorithms in Machine Learning , 1996, Online Algorithms.

[20]  Jacob Ratkiewicz,et al.  Political Polarization on Twitter , 2011, ICWSM.

[21]  John R. Kender,et al.  Visual memes in social media: tracking real-world news in YouTube videos , 2011, ACM Multimedia.

[22]  Deng Cai,et al.  Topic modeling with network regularization , 2008, WWW.

[23]  Lada A. Adamic,et al.  Memes Online: Extracted, Subtracted, Injected, and Recollected , 2011, ICWSM.

[24]  Peter Mika,et al.  Ontologies are us: A unified model of social networks and semantics , 2005, J. Web Semant..

[25]  Luigi Di Caro,et al.  Personalized emerging topic detection based on a term aging model , 2013, ACM Trans. Intell. Syst. Technol..

[26]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[27]  Filippo Menczer,et al.  The Digital Evolution of Occupy Wall Street , 2013, PloS one.

[28]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[29]  Lei Yang,et al.  We know what @you #tag: does the dual role affect hashtag adoption? , 2012, WWW.

[30]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[31]  Duncan J. Watts,et al.  Who says what to whom on twitter , 2011, WWW.

[32]  Cesare Alippi Learning in Non-stationary Environments , 2014, IJCCI.

[33]  Gerhard J. Woeginger,et al.  Developments from a June 1996 seminar on Online algorithms: the state of the art , 1998 .

[34]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[35]  Alfredo J. Morales,et al.  Users structure and behavior on an online social network during a political protest , 2012 .

[36]  Filippo Menczer,et al.  Traveling trends: social butterflies or frequent fliers? , 2013, COSN '13.

[37]  Shonali Krishnaswamy,et al.  Mining data streams: a review , 2005, SGMD.

[38]  Jacob Ratkiewicz,et al.  Predicting the Political Alignment of Twitter Users , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[39]  Filippo Menczer,et al.  Evolution of online user behavior during a social upheaval , 2014, WebSci '14.

[40]  Thomas Ertl,et al.  Spatiotemporal anomaly detection through visual analysis of geolocated Twitter messages , 2012, 2012 IEEE Pacific Visualization Symposium.

[41]  Filippo Menczer,et al.  Clustering memes in social media , 2013, 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013).

[42]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[43]  Michael S. Bernstein,et al.  Twitinfo: aggregating and visualizing microblogs for event exploration , 2011, CHI.

[44]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[45]  Ari Rappoport,et al.  What's in a hashtag?: content based prediction of the spread of ideas in microblogging communities , 2012, WSDM '12.

[46]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[47]  Wen-tau Yih,et al.  Measuring Word Relatedness Using Heterogeneous Vector Space Models , 2012, HLT-NAACL.

[48]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[49]  Susanne Albers,et al.  On‐Line Algorithms , 2013 .

[50]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[51]  Filippo Menczer,et al.  The Geospatial Characteristics of a Social Movement Communication Network , 2013, PloS one.

[52]  Thomas Seidl,et al.  Hierarchical Clustering for Real-Time Stream Data with Noise , 2011, SSDBM.

[53]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[54]  Jesús S. Aguilar-Ruiz,et al.  Knowledge discovery from data streams , 2009, Intell. Data Anal..

[55]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[56]  Shi Zhong,et al.  Efficient online spherical k-means clustering , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[57]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[58]  Jacob Ratkiewicz,et al.  Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams , 2010, ArXiv.