Semi-Supervised Event-related Tweet Identification with Dynamic Keyword Generation

Twitter provides us a convenient channel to get access to the immediate information about major events. However, it is challenging to acquire a clean and complete set of event-related data due to the characteristics of tweets, eg short and noisy. In this paper, we propose a semi-supervised method to obtain high quality event-related tweets from Twitter stream, in terms of precision and recall. Specifically, candidate event-related tweets are selected based on a set of keywords. We propose to generate and update these keywords dynamically along the event development. To be included in this keyword set, words are evaluated based on single word properties, property based on co-occurred words, and changes of word importance over time. Our solution is capable of capturing keywords of emerging aspects or aspects with increasing importance along event evolvement. By leveraging keyword importance information and a few labeled tweets, we propose a semi-supervised expectation maximization process to identify event-related tweets. This process significantly reduces human effort in acquiring high quality tweets. Experiments on three real world datasets show that our solution outperforms state-of-the-art approaches by up to 10% in F1 measure.

[1]  Arjun Mukherjee,et al.  Discovering coherent topics using general knowledge , 2013, CIKM.

[2]  Miljenko Huzak,et al.  Chi-Square Distribution , 2011, International Encyclopedia of Statistical Science.

[3]  Alexander Mehler,et al.  On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models , 2016, ACL.

[4]  Shuai Wang,et al.  Identifying Search Keywords for Finding Relevant Social Media Posts , 2016, AAAI.

[5]  Carlos Castillo,et al.  What to Expect When the Unexpected Happens: Social Media Communications Across Crises , 2015, CSCW.

[6]  Hila Becker,et al.  Beyond Trending Topics: Real-World Event Identification on Twitter , 2011, ICWSM.

[7]  Vikas Sindhwani,et al.  Emerging topic detection using dictionary learning , 2011, CIKM '11.

[8]  Thorsten Joachims,et al.  Interactively optimizing information retrieval systems as a dueling bandits problem , 2009, ICML '09.

[9]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[10]  Hila Becker,et al.  Identifying content for planned events across social media sites , 2012, WSDM '12.

[11]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[12]  Daphne Koller,et al.  Support Vector Machine Active Learning with Applications to Text Classification , 2000, J. Mach. Learn. Res..

[13]  David D. Lewis,et al.  Heterogeneous Uncertainty Sampling for Supervised Learning , 1994, ICML.

[14]  R. Khan,et al.  Sequential Tests of Statistical Hypotheses. , 1972 .

[15]  Shuai Wang,et al.  Targeted Topic Modeling for Focused Analysis , 2016, KDD.

[16]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[17]  Ming Yang,et al.  Filtering big data from social media - Building an early warning system for adverse drug reactions , 2015, J. Biomed. Informatics.

[18]  Jimmy J. Lin,et al.  Estimating topical volume in social media streams , 2016, SAC.

[19]  Zhiyuan Liu,et al.  Automatic Keyphrase Extraction via Topic Decomposition , 2010, EMNLP.

[20]  Walid Magdy,et al.  Adaptive Method for Following Dynamic Topics on Twitter , 2014, ICWSM.

[21]  James Allan,et al.  Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[22]  Liangyu Chen,et al.  An Unsupervised Framework of Exploring Events on Twitter: Filtering, Extraction and Categorization , 2015, AAAI.

[23]  Dongyan Zhao,et al.  Adaptive Evolutionary Filtering in Real-Time Twitter Stream , 2016, CIKM.

[24]  Yue Wang,et al.  ReQ-ReC: high recall retrieval with query pooling and interactive classification , 2014, SIGIR.

[25]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[26]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[27]  Vincent Ng,et al.  Automatic Keyphrase Extraction: A Survey of the State of the Art , 2014, ACL.

[28]  H. O. Lancaster,et al.  Chi-Square Distribution , 2005 .

[29]  Zhoujun Li,et al.  Emerging topic detection for organizations from microblogs , 2013, SIGIR.

[30]  Xiaohui Yan,et al.  Learning Topics in Short Texts by Non-negative Matrix Factorization on Term Correlation Matrix , 2013, SDM.

[31]  Sharad Mehrotra,et al.  Online Adaptive Topic Focused Tweet Acquisition , 2016, CIKM.