An analysis of 14 Million tweets on hashtag‐oriented spamming*

Over the years, Twitter has become a popular platform for information dissemination and information gathering. However, the popularity of Twitter has attracted not only legitimate users but also spammers who exploit social graphs, popular keywords, and hashtags for malicious purposes. In this paper, we present a detailed analysis of the HSpam14 dataset, which contains 14 million tweets with spam and ham (i.e., nonspam) labels, to understand spamming activities on Twitter. The primary focus of this paper is to analyze various aspects of spam on Twitter based on hashtags, tweet contents, and user profiles, which are useful for both tweet‐level and user‐level spam detection. First, we compare the usage of hashtags in spam and ham tweets based on frequency, position, orthography, and co‐occurrence. Second, for content‐based analysis, we analyze the variations in word usage, metadata, and near‐duplicate tweets. Third, for user‐based analysis, we investigate user profile information. In our study, we validate that spammers use popular hashtags to promote their tweets. We also observe differences in the usage of words in spam and ham tweets. Spam tweets are more likely to be emphasized using exclamation points and capitalized words. Furthermore, we observe that spammers use multiple accounts to post near‐duplicate tweets to promote their services and products. Unlike spammers, legitimate users are likely to provide more information such as their locations and personal descriptions in their profiles. In summary, this study presents a comprehensive analysis of hashtags, tweet contents, and user profiles in Twitter spamming.

[1]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[2]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[3]  Haining Wang,et al.  Detecting Social Spam Campaigns on Twitter , 2012, ACNS.

[4]  Huan Liu,et al.  Social Spammer Detection in Microblogging , 2013, IJCAI.

[5]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[6]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[7]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[8]  Kyumin Lee,et al.  Crowdturfers, Campaigns, and Social Media: Tracking and Revealing Crowdsourced Manipulation of Social Media , 2013, ICWSM.

[9]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[10]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[11]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[12]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[13]  Kyumin Lee,et al.  The social honeypot project: protecting online communities from spammers , 2010, WWW '10.

[14]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[15]  Dawn Xiaodong Song,et al.  Suspended accounts in retrospect: an analysis of twitter spam , 2011, IMC '11.

[16]  Fabrício Benevenuto,et al.  You followed my bot! Transforming robots into influential users in Twitter , 2013, First Monday.

[17]  Calton Pu,et al.  Click traffic analysis of short URL spam on Twitter , 2013, 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[18]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[19]  Ari Rappoport,et al.  What's in a hashtag?: content based prediction of the spread of ideas in microblogging communities , 2012, WSDM '12.

[20]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[21]  Vern Paxson,et al.  Consequences of Connectivity: Characterizing Account Hijacking on Twitter , 2014, CCS.

[22]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[23]  Claire Cardie,et al.  TopicSpam: a Topic-Model based approach for spam detection , 2013, ACL.

[24]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[25]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[26]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[27]  Gang Wang,et al.  Serf and turf: crowdturfing for fun and profit , 2011, WWW.

[28]  Gang Wang,et al.  Man vs. Machine: Practical Adversarial Detection of Malicious Crowdsourcing Workers , 2014, USENIX Security Symposium.

[29]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[30]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[31]  Ee-Peng Lim,et al.  Detecting product review spammers using rating behaviors , 2010, CIKM.

[32]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[33]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[34]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[35]  Songqing Chen,et al.  UNIK: unsupervised social network spam detection , 2013, CIKM.

[36]  Danah Boyd,et al.  Detecting Spam in a Twitter Network , 2009, First Monday.

[37]  Yi Yang,et al.  Learning to Identify Review Spam , 2011, IJCAI.

[38]  James Caverlee,et al.  Detecting Spam URLs in Social Media via Behavioral Analysis , 2015, ECIR.

[39]  Virgílio A. F. Almeida,et al.  Detecting Spammers and Content Promoters in Online Video Social Networks , 2009, IEEE INFOCOM Workshops 2009.

[40]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[41]  Tim Oates,et al.  Ensembles in adversarial classification for spam , 2009, CIKM.

[42]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[43]  Huan Liu,et al.  Online Social Spammer Detection , 2014, AAAI.

[44]  Junhui Wang,et al.  Detecting group review spam , 2011, WWW.

[45]  Aixin Sun,et al.  HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research , 2015, SIGIR.

[46]  Georgia Koutrika,et al.  Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[47]  Rizal Setya Perdana What is Twitter , 2013 .

[48]  Byron Choi,et al.  Detecting spam blogs from blog search results , 2011, Inf. Process. Manag..

[49]  Huan Liu,et al.  Social Spammer Detection with Sentiment Information , 2014, 2014 IEEE International Conference on Data Mining.

[50]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.