Feature selection using Principal Component Analysis for massive retweet detection

Social networks become a major actor in massive information propagation. In the context of the Twitter platform, its popularity is due in part to the capability of relaying messages (i.e. tweets) posted by users. This particular mechanism, called retweet, allows users to massively share tweets they consider as potentially interesting for others. In this paper, we propose to study the behavior of tweets that have been massively retweeted in a short period of time. We first analyze specific tweet features through a Principal Component Analysis (PCA) to better understand the behavior of highly forwarded tweets as opposed to those retweeted only a few times. Finally, we propose to automatically detect the massively retweeted messages. The qualitative study is used to select the features allowing the best classification performance. We show that the selection of only the most correlated features, leads to the best classification accuracy (F-measure of 65.7%), with a gain of about 2.4 points in comparison to the use of the complete set of features.

[1]  Barry Smyth,et al.  Using twitter to recommend real-time topical news , 2009, RecSys '09.

[2]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[3]  Ed H. Chi,et al.  Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network , 2010, 2010 IEEE Second International Conference on Social Computing.

[4]  Virgílio A. F. Almeida,et al.  Understanding factors that affect response rates in twitter , 2012, HT '12.

[5]  Gleb Gusev,et al.  Prediction of retweet cascade size over time , 2012, CIKM.

[6]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[7]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[8]  Chia-Hua Ho,et al.  Recent Advances of Large-Scale Linear Classification , 2012, Proceedings of the IEEE.

[9]  Mohamed Morchid,et al.  Thematic Representation of Short Text Messages with Latent Topics: Application in the Twitter context , 2012 .

[10]  H. Abdi,et al.  Principal component analysis , 2010 .

[11]  H. Kaiser A NOTE ON GUTTMAN'S LOWER BOUND FOR THE NUMBER OF COMMON FACTORS1 , 1961 .

[12]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[13]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[14]  Scott Counts,et al.  Predicting the Speed, Scale, and Range of Information Diffusion in Twitter , 2010, ICWSM.

[15]  Mary Beth Rosson,et al.  How and why people Twitter: the role that micro-blogging plays in informal communication at work , 2009, GROUP.

[16]  Hugo Liu,et al.  Social Network Profiles as Taste Performances , 2007, J. Comput. Mediat. Commun..

[17]  Larry D. Hostetler,et al.  The estimation of the gradient of a density function, with applications in pattern recognition , 1975, IEEE Trans. Inf. Theory.

[18]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[19]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[20]  Augustin-Louis Cauchy,et al.  Sur l'équation à l'aide de laquelle on détermine les inégalités séculaires des mouvements des planètes , 2009 .

[21]  V. Vapnik Pattern recognition using generalized portrait method , 1963 .

[22]  Le Song,et al.  Supervised feature selection via dependence estimation , 2007, ICML '07.

[23]  H. Kaiser The varimax criterion for analytic rotation in factor analysis , 1958 .

[24]  Johan Bollen,et al.  Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena , 2009, ICWSM.

[25]  Shyhtsun Felix Wu,et al.  Measuring message propagation and social influence on Twitter.com , 2010, Int. J. Commun. Networks Distributed Syst..

[26]  Irina Rish,et al.  An empirical study of the naive Bayes classifier , 2001 .

[27]  Ting Wang,et al.  Who will retweet me?: finding retweeters in twitter , 2013, SIGIR.

[28]  Matthew Michelson,et al.  Tweet Disambiguate Entities Retrieve Folksonomy SubTree Step 1 : Discover Categories Generate Topic Profile from SubTrees Step 2 : Discover Profile Topic Profile : “ English Football ” “ World Cup ” , 2011 .

[29]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[30]  Brian D. Davison,et al.  Predicting popular messages in Twitter , 2011, WWW.

[31]  Efthimis N. Efthimiadis,et al.  Conversational tagging in twitter , 2010, HT '10.

[32]  Huan Liu,et al.  Spectral feature selection for supervised and unsupervised learning , 2007, ICML '07.

[33]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[34]  Mohamed Morchid,et al.  Characterizing and Predicting Bursty Events: The Buzz Case Study on Twitter , 2014, LREC.

[35]  Hongfei Yan,et al.  Comparing Twitter and Traditional Media Using Topic Models , 2011, ECIR.

[36]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .