Detecting Non‐personal and Spam Users on Geo‐tagged Twitter Network

With the rapid growth and popularity of mobile devices and location-aware technologies, online social networks such as Twitter have become an important data source for scientists to conduct geo-social network research. Non-personal accounts, spam users and junk tweets, however, pose severe problems to the extraction of meaningful information and the validation of any research findings on tweets or twitter users. Therefore, the detection of such users is a critical and fundamental step for twitter-related geographic research. In this study, we develop a methodological framework to: (1) extract user characteristics based on geographic, graph-based and content-based features of tweets; (2) construct a training dataset by manually inspecting and labeling a large sample of twitter users; and (3) derive reliable rules and knowledge for detecting non-personal users with supervised classification methods. The extracted geographic characteristics of a user include maximum speed, mean speed, the number of different counties that the user has been to, and others. Content-based characteristics for a user include the number of tweets per month, the percentage of tweets with URLs or Hashtags, and the percentage of tweets with emotions, detected with sentiment analysis. The extracted rules are theoretically interesting and practically useful. Specifically, the results show that geographic features, such as the average speed and frequency of county changes, can serve as important indicators of non-personal users. For non-spatial characteristics, the percentage of tweets with a high human factor index, the percentage of tweets with URLs, and the percentage of tweets with mentioned/replied users are the top three features in detecting non-personal users.

[1]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[2]  Andrew McCallum,et al.  Using Maximum Entropy for Text Classification , 1999 .

[3]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[4]  V. Paxson,et al.  The Underground on 140 Characters or Less ∗ , 2010 .

[5]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[6]  Michael F. Goodchild,et al.  Citizens as Voluntary Sensors: Spatial Data Infrastructure in the World of Web 2.0 , 2007, Int. J. Spatial Data Infrastructures Res..

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[9]  Fang Wu,et al.  Social Networks that Matter: Twitter Under the Microscope , 2008, First Monday.

[10]  Saurabh Bagchi,et al.  Spam detection in voice-over-IP calls through semi-supervised clustering , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[11]  Paul A. Zandbergen,et al.  Accuracy of iPhone Locations: A Comparison of Assisted GPS, WiFi and Cellular Positioning , 2009 .

[12]  Tao Tao,et al.  Transductive link spam detection , 2007, AIRWeb '07.

[13]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[14]  Juan Martínez-Romo,et al.  Detecting malicious tweets in trending topics using a statistical analysis of language , 2013, Expert Syst. Appl..

[15]  Mary Anne Kennan,et al.  The State of the Nation: A Snapshot of Australian Institutional Repositories , 2009, First Monday.

[16]  M. Chuah,et al.  Spam Detection on Twitter Using Traditional Classifiers , 2011, ATC.

[17]  Sushil Jajodia,et al.  Detecting Automation of Twitter Accounts: Are You a Human, Bot, or Cyborg? , 2012, IEEE Transactions on Dependable and Secure Computing.

[18]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[19]  Karl-Michael Schneider,et al.  A Comparison of Event Models for Naive Bayes Anti-Spam E-Mail Filtering , 2003, EACL.

[20]  Alex Hai Wang,et al.  Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach , 2010, DBSec.

[21]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, Web Intelligence.

[22]  Chris Moore,et al.  Sharing music files: Tactics of a challenge to the industry , 2010, First Monday.

[23]  Bertrand De Longueville,et al.  "OMG, from here, I can see the flames!": a use case of mining location based social networks to acquire spatio-temporal data on forest fires , 2009, LBSN '09.

[24]  J. Ross Quinlan,et al.  Improved Use of Continuous Attributes in C4.5 , 1996, J. Artif. Intell. Res..

[25]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[26]  Xinchang Zhang,et al.  Link based small sample learning for web spam detection , 2009, WWW '09.

[27]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[28]  Scott Counts,et al.  Predicting the Speed, Scale, and Range of Information Diffusion in Twitter , 2010, ICWSM.

[29]  Anuj R. Jaiswal,et al.  Analytics : Applications in Crisis Management , 2011 .

[30]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).