论文信息 - Extracting a Topic Specific Dataset from a Twitter Archive

Extracting a Topic Specific Dataset from a Twitter Archive

Datasets extracted from the microblogging service Twitter are often generated using specific query terms or hashtags. We describe how a dataset produced using the query term ‘syria’ can be increased in size to include tweets on the topic of Syria that do not contain that query term. We compare three methods for this task, using the top hashtags from the set as search terms, using a hand selected set of hashtags as search terms and using LDA topic modelling to cluster tweets and selecting appropriate clusters. We describe an evaluation method for accessing the relevance and accuracy of the tweets returned.

[1] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2] Craig MacDonald,et al. Evaluating Real-Time Search over Tweets , 2012, ICWSM.

[3] Iadh Ounis,et al. Real-Time Detection, Tracking, and Monitoring of Automatically Discovered Events in Social Media , 2014, ACL.

[4] Jure Leskovec,et al. Patterns of temporal variation in online media , 2011, WSDM '11.