Cluster-discovery of Twitter messages for event detection and trending

Abstract Social media data carries abundant hidden occurrences of real-time events. In this paper, a novel methodology is proposed for detecting and trending events from tweet clusters that are discovered by using locality sensitive hashing (LSH) technique. Key challenges include: (1) construction of dictionary using incremental term frequency–inverse document frequency (TF–IDF) in high-dimensional data to create tweet feature vector, (2) leveraging LSH to find truly interesting events, (3) trending the behavior of event based on time, geo-locations and cluster size, and (4) speed-up the cluster-discovery process while retaining the cluster quality. Experiments are conducted for a specific event and the clusters discovered using LSH and K-means are compared with group average agglomerative clustering technique.

[1]  M. Slaney,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors [Lecture Notes] , 2008, IEEE Signal Processing Magazine.

[2]  Meredith Ringel Morris,et al.  What do people ask their social networks, and why?: a survey study of status message q&a behavior , 2010, CHI.

[3]  H. Varian,et al.  Predicting the Present with Google Trends , 2012 .

[4]  Gilad Mishne,et al.  Fast data in the era of big data: Twitter's real-time related query suggestion architecture , 2012, SIGMOD '13.

[5]  Hakan Ferhatosmanoglu,et al.  Short text classification in twitter to improve information filtering , 2010, SIGIR.

[6]  Anirban Dasgupta,et al.  Fast locality-sensitive hashing , 2011, KDD.

[7]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[8]  David R. Karger,et al.  Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections , 2017, SIGF.

[9]  Kamalakar Karlapalem,et al.  ET: events from tweets , 2013, WWW.

[10]  Jugal K. Kalita,et al.  Experiments in Microblog Summarization , 2010, 2010 IEEE Second International Conference on Social Computing.

[11]  Florian Michahelles,et al.  Monitoring Trends on Facebook , 2011, 2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing.

[12]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[13]  Hila Becker,et al.  Hip and trendy: Characterizing emerging trends on Twitter , 2011, J. Assoc. Inf. Sci. Technol..

[14]  H. Stanley,et al.  Quantifying Trading Behavior in Financial Markets Using Google Trends , 2013, Scientific Reports.

[15]  Danah Boyd,et al.  Social Network Sites: Definition, History, and Scholarship , 2007, J. Comput. Mediat. Commun..

[16]  Hanan Samet,et al.  TwitterStand: news in tweets , 2009, GIS.

[17]  Patrick Pantel,et al.  Randomized Algorithms and NLP: Using Locality Sensitive Hash Functions for High Speed Noun Clustering , 2005, ACL.

[18]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[19]  Sasa Petrovic,et al.  Real-time event detection in massive streams , 2013 .

[20]  I-Hsien Ting,et al.  The 8th International Conference on Knowledge Management in Organizations: Social and Big Data Computing for Knowledge Management , 2013 .

[21]  Abdolreza Abhari,et al.  Event detection and trending in multiple social networking sites , 2013, SpringSim.

[22]  Thorsten Brants,et al.  A System for new event detection , 2003, SIGIR.

[23]  Pinar Senkul,et al.  Semantic Expansion of Hashtags for Enhanced Event Detection in Twitter , 2012 .

[24]  James Allan,et al.  Detections , Bounds , and Timelines : UMass and TDT-3 , 2000 .

[25]  James Caverlee,et al.  Content-based crowd retrieval on the real-time web , 2012, CIKM.

[26]  Alexander J. Smola,et al.  Unified analysis of streaming news , 2011, WWW.

[27]  Gerhard Weikum,et al.  YAGO2: exploring and querying world knowledge in time, space, context, and many languages , 2011, WWW.

[28]  Jonathan G. Fiscus,et al.  Topic detection and tracking evaluation overview , 2002 .

[29]  Michael A. Casey,et al.  Locality-Sensitive Hashing for Finding Nearest Neighbors , 2008 .

[30]  Nicole Immorlica,et al.  Locality-sensitive hashing scheme based on p-stable distributions , 2004, SCG '04.

[31]  Jimmy J. Lin,et al.  Scaling big data mining infrastructure: the twitter experience , 2013, SKDD.

[32]  Yiming Yang,et al.  A study of retrospective and on-line event detection , 1998, SIGIR '98.

[33]  Hans-Jörg Schek,et al.  A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.

[34]  Ashwin Lall,et al.  Online Generation of Locality Sensitive Hash Signatures , 2010, ACL.

[35]  Michael T. Goodrich,et al.  Data Structures and Algorithms in Python , 2013 .

[36]  Edward Fredkin,et al.  Trie memory , 1960, Commun. ACM.

[37]  Li Bicheng,et al.  Bag-of-Visual-Words Based Object Retrieval with E2LSH and Query Expansion , 2012 .

[38]  James Allan,et al.  Topic Detection and Tracking , 2002, The Information Retrieval Series.

[39]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[40]  Jonathon S. Hare,et al.  Event detection using Twitter and structured semantic query expansion , 2012, CrowdSens '12.

[41]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[42]  Santosh S. Vempala,et al.  Locality-preserving hashing in multidimensional spaces , 1997, STOC '97.

[43]  H. Varian,et al.  Predicting the Present with Google Trends , 2009 .

[44]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[45]  Miles Osborne,et al.  Streaming First Story Detection with application to Twitter , 2010, NAACL.

[46]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[47]  Danah Boyd,et al.  Tweeting from the Town Square: Measuring Geographic Local Networks , 2010, ICWSM.

[48]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[49]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[50]  Kenneth L. Clarkson,et al.  An algorithm for approximate closest-point queries , 1994, SCG '94.

[51]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[52]  Lyle Ungar,et al.  Discovery of significant emerging trends , 2010, KDD.