Clustering memes in social media

The increasing pervasiveness of social media creates new opportunities to study human social behavior, while challenging our capability to analyze their massive data streams. One of the emerging tasks is to distinguish between different kinds of activities, for example engineered misinformation campaigns versus spontaneous communication. Such detection problems require a formal definition of meme, or unit of information that can spread from person to person through the social network. Once a meme is identified, supervised learning methods can be applied to classify different types of communication. The appropriate granularity of a meme, however, is hardly captured from existing entities such as tags and keywords. Here we present a framework for the novel task of detecting memes by clustering messages from large streams of social data. We evaluate various similarity measures that leverage content, metadata, network features, and their combinations. We also explore the idea of pre-clustering on the basis of existing entities. A systematic evaluation is carried out using a manually curated dataset as ground truth. Our analysis shows that pre-clustering and a combination of heterogeneous features yield the best trade-off between number of clusters and their quality, demonstrating that a simple combination based on pairwise maximization of similarity is as effective as a non-trivial optimization of parameters. Our approach is fully automatic, unsupervised, and scalable for real-time detection of memes in streaming data.

[1]  Krithi Ramamritham,et al.  Real Time Discovery of Dense Clusters in Highly Dynamic Graphs: Identifying Real World Events in Highly Dynamic Environments , 2012, Proc. VLDB Endow..

[2]  Michael S. Bernstein,et al.  Twitinfo: aggregating and visualizing microblogs for event exploration , 2011, CHI.

[3]  Jure Leskovec,et al.  Meme-tracking and the dynamics of the news cycle , 2009, KDD.

[4]  Raymond J. Mooney,et al.  Multi-Prototype Vector-Space Models of Word Meaning , 2010, NAACL.

[5]  Peter Mika Ontologies Are Us: A Unified Model of Social Networks and Semantics , 2005, International Semantic Web Conference.

[6]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[7]  Ciro Cattuto,et al.  Dynamical classes of collective attention in twitter , 2011, WWW.

[8]  Wen-tau Yih,et al.  Measuring Word Relatedness Using Heterogeneous Vector Space Models , 2012, HLT-NAACL.

[9]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[10]  M. Meilă Comparing clusterings---an information based distance , 2007 .

[11]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[12]  Filippo Menczer,et al.  The Digital Evolution of Occupy Wall Street , 2013, PloS one.

[13]  Mario Cataldi,et al.  Emerging topic detection on Twitter based on temporal and social terms evaluation , 2010, MDMKDD '10.

[14]  Bernardo A. Huberman,et al.  Usage patterns of collaborative tagging systems , 2006, J. Inf. Sci..

[15]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[16]  Jacob Ratkiewicz,et al.  Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams , 2010, ArXiv.

[17]  Leon Danon,et al.  Comparing community structure identification , 2005, cond-mat/0505245.

[18]  Jacob Ratkiewicz,et al.  Detecting and Tracking Political Abuse in Social Media , 2011, ICWSM.

[19]  Filippo Menczer,et al.  The Geospatial Characteristics of a Social Movement Communication Network , 2013, PloS one.

[20]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[21]  Duncan J. Watts,et al.  Who says what to whom on twitter , 2011, WWW.

[22]  Andrea Lancichinetti,et al.  Detecting the overlapping and hierarchical community structure in complex networks , 2008, 0802.1218.

[23]  Matthew Hurst,et al.  Event Detection and Tracking in Social Streams , 2009, ICWSM.

[24]  Eni Mustafaraj,et al.  From Obscurity to Prominence in Minutes: Political Speech and Real-Time Search , 2010 .

[25]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[26]  Rizal Setya Perdana What is Twitter , 2013 .

[27]  Jacob Ratkiewicz,et al.  Truthy: mapping the spread of astroturf in microblog streams , 2010, WWW.

[28]  John R. Kender,et al.  Visual memes in social media: tracking real-world news in YouTube videos , 2011, ACM Multimedia.

[29]  Lada A. Adamic,et al.  Memes Online: Extracted, Subtracted, Injected, and Recollected , 2011, ICWSM.

[30]  Hila Becker,et al.  Learning similarity metrics for event identification in social media , 2010, WSDM '10.

[31]  Charu C. Aggarwal,et al.  Event Detection in Social Streams , 2012, SDM.

[32]  Christos Faloutsos,et al.  Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining , 2013, ASONAM 2013.

[33]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.