Activity-based sampling of Twitter users for temporal prediction models

Increasingly more applications rely on crowd-sourced data from social media. Some of these applications are concerned with real-time data streams, while others are more focused on acquiring temporal footprints from historical timelines of users. Nevertheless, determining the subset of "credible" users is crucial. While the majority of sampling approaches focus on individuals' static networks, dynamic user activity over time is usually not considered, which may result in activity gaps in the collected data. Models based on noisy and missing data can significantly degrade in performance. In this study, we demonstrate how to sample Twitter users in order to produce more credible data for temporal prediction models. We present an activity-based sampling approach where users are selected based on their historical activities in Twitter. The predictability of the collected content from activity-based and random sampling is compared in a user-centric temporal model. The results indicate the importance of an activity-oriented sampling method for the acquisition of more credible content for temporal models.

[1]  M. Goodchild,et al.  Spatial, temporal, and socioeconomic patterns in the use of Twitter and Flickr , 2013 .

[2]  José Martins,et al.  TwitterEcho: a distributed focused crawler to support open research with twitter data , 2012, WWW.

[3]  Danah Boyd,et al.  I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience , 2011, New Media Soc..

[4]  Amit Srivastava,et al.  Leveraging candidate popularity on Twitter to predict election outcome , 2013, SNAKDD '13.

[5]  Jiahui Wang,et al.  Rolling Analysis of Time Series , 2003 .

[6]  Rodger W. Griffeth,et al.  “Nothing Endures but Change”: Investigating Temporal Dynamics within a Turnover Model , 2015 .

[7]  Nadia Magnenat-Thalmann,et al.  Who, where, when and what: discover spatio-temporal topics for twitter users , 2013, KDD.

[8]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[9]  Masoud Makrehchi,et al.  Temporal Topic Inference for Trend Prediction , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[10]  Yang Song,et al.  Topical Keyphrase Extraction from Twitter , 2011, ACL.

[11]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[12]  Brian D. Davison,et al.  Co-factorization machines: modeling user interests and predicting individual decisions in Twitter , 2013, WSDM.

[13]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[14]  Kathleen M. Carley,et al.  Two 1%s Don't Make a Whole: Comparing Simultaneous Samples from Twitter's Streaming API , 2014, SBP.

[15]  Nathalie Japkowicz,et al.  Sampling Online Social Networks Using Coupling from the Past , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[16]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[17]  Iuliia Chepurna,et al.  How to Predict Social Trends by Mining User Sentiments , 2015, SBP.

[18]  Matthew Smith,et al.  A real-time architecture for detection of diseases using social networks: design, implementation and evaluation , 2012, HT '12.

[19]  Johan Bollen,et al.  Twitter mood predicts the stock market , 2010, J. Comput. Sci..

[20]  Athina Markopoulou,et al.  On the bias of BFS (Breadth First Search) , 2010, 2010 22nd International Teletraffic Congress (lTC 22).

[21]  Zhen Wang,et al.  An efficient and privacy-preserving ranked fuzzy keywords search over encrypted cloud data , 2016, 2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC).

[22]  Krishna P. Gummadi,et al.  On sampling the wisdom of crowds: random vs. expert sampling of the twitter stream , 2013, CIKM.

[23]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[24]  Krishna P. Gummadi,et al.  Cognos: crowdsourcing search for topic experts in microblogs , 2012, SIGIR '12.

[25]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[26]  Marcelo Milrad,et al.  Digital humanities as a cross-sector and cross-discipline initiative: Prospects in the Linnaeus University region , 2016, 2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC).