Predicting susceptibility to social bots on Twitter

The popularity of the Twitter social networking site has made it a target for social bots, which use increasingly-complex algorithms to engage users and pretend to be humans. While much research has studied how to identify such bots in the process of spam detection, little research has looked at the other side of the question - detecting users likely to be fooled by bots. In this paper, we examine a dataset consisting of 610 users who were messaged by Twitter bots, and determine which features describing these users were most helpful in predicting whether or not they would interact with the bots (through replies or following the bot). We then use six classifiers to build models for predicting whether a given user will interact with the bot, both using the selected features and using all features. We find that a users' Klout score, friends count, and followers count are most predictive of whether a user will interact with a bot, and that the Random Forest algorithm produces the best classifier, when used in conjunction with one of the better feature ranking algorithms (although poor feature ranking can actually make performance worse than no feature ranking). Overall, these results show promise for helping understand which users are most vulnerable to social bots.

[1]  Gregory J. Park,et al.  Predicting Dark Triad Personality Traits from Twitter Usage and a Linguistic Analysis of Tweets , 2012, 2012 11th International Conference on Machine Learning and Applications.

[2]  jimi adams,et al.  Capturing context: Integrating spatial and social network analyses , 2012, Soc. Networks.

[3]  Chao Yang,et al.  CATS: Characterizing automation of Twitter spammers , 2013, 2013 Fifth International Conference on Communication Systems and Networks (COMSNETS).

[4]  Nick Feamster,et al.  Observing common spam in Twitter and email , 2012, Internet Measurement Conference.

[5]  Dawn Xiaodong Song,et al.  Design and Evaluation of a Real-Time URL Spam Filtering Service , 2011, 2011 IEEE Symposium on Security and Privacy.

[6]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.

[7]  Joseph Weizenbaum,et al.  and Machine , 1977 .

[8]  Hollis Thomases,et al.  Twitter Marketing: An Hour a Day , 2009 .

[9]  S. Cessie,et al.  Ridge Estimators in Logistic Regression , 1992 .

[10]  José del Campo-Ávila,et al.  Analizying Factors to Increase the Influence of a Twitter User , 2011, PAAMS.

[11]  A. M. Turing,et al.  Computing Machinery and Intelligence , 1950, The Philosophy of Artificial Intelligence.

[12]  Tian-Yu Liu,et al.  EasyEnsemble and Feature Selection for Imbalance Data Sets , 2009, 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing.

[13]  Taghi M. Khoshgoftaar,et al.  A comparative evaluation of feature ranking methods for high dimensional bioinformatics data , 2011, 2011 IEEE International Conference on Information Reuse & Integration.

[14]  Danah Boyd,et al.  Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[15]  Jong Kim,et al.  Spam Filtering in Twitter Using Sender-Receiver Relationship , 2011, RAID.

[16]  J. Cotterell Social Networks in Youth and Adolescence , 2007 .

[17]  Salto Martínez Rodrigo,et al.  Development and Implementation of a Chat Bot in a Social Network , 2012, 2012 Ninth International Conference on Information Technology - New Generations.

[18]  Xue-wen Chen,et al.  Combating the Small Sample Class Imbalance Problem Using Feature Selection , 2010, IEEE Transactions on Knowledge and Data Engineering.

[19]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[20]  M. Silvapulle,et al.  Ridge estimation in logistic regression , 1988 .

[21]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[22]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[23]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[24]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[25]  Dawn Xiaodong Song,et al.  Suspended accounts in retrospect: an analysis of twitter spam , 2011, IMC '11.

[26]  Taghi M. Khoshgoftaar,et al.  Comparative Analysis of DNA Microarray Data through the Use of Feature Selection Techniques , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[27]  J. Pennebaker,et al.  The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods , 2010 .

[28]  Konstantin Beznosov,et al.  The socialbot network: when bots socialize for fame and money , 2011, ACSAC '11.

[29]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[30]  David R. Brake,et al.  ON THE RAPID RISE OF SOCIAL NETWORKING SITES: NEW FINDINGS AND POLICY IMPLICATIONS , 2010 .

[31]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[32]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[33]  Stan Matwin,et al.  STochFS: A Framework for Combining Feature Selection Outcomes Through a Stochastic Process , 2005, PKDD.

[34]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[35]  Ravi Kumar,et al.  Structure and evolution of online social networks , 2006, KDD '06.

[36]  Markus Strohmaier,et al.  When Social Bots Attack: Modeling Susceptibility of Users in Online Social Networks , 2012, #MSM.

[37]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[38]  Taghi M. Khoshgoftaar,et al.  Using Twitter Content to Predict Psychopathy , 2012, 2012 11th International Conference on Machine Learning and Applications.

[39]  Igor Kononenko,et al.  Estimating Attributes: Analysis and Extensions of RELIEF , 1994, ECML.

[40]  F. Stokman Evolution of social networks , 1997 .

[41]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[42]  Barry Wellman,et al.  Geography of Twitter networks , 2012, Soc. Networks.

[43]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[44]  Kevin Borders,et al.  Social networks and context-aware spam , 2008, CSCW.

[45]  Taghi M. Khoshgoftaar,et al.  Feature Selection with High-Dimensional Imbalanced Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[46]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[47]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.