HSpam14: A Collection of 14 Million Tweets for Hashtag-Oriented Spam Research

Hashtag facilitates information diffusion in Twitter by creating dynamic and virtual communities for information aggregation from all Twitter users. Because hashtags serve as additional channels for one's tweets to be potentially accessed by other users than her own followers, hashtags are targeted for spamming purposes (e.g., hashtag hijacking), particularly the popular and trending hashtags. Although much effort has been devoted to fighting against email/web spam, limited studies are on hashtag-oriented spam in tweets. In this paper, we collected 14 million tweets that matched some trending hashtags in two months' time and then conducted systematic annotation of the tweets being spam and ham (i.e., non-spam). We name the annotated dataset HSpam14. Our annotation process includes four major steps: (i) heuristic-based selection to search for tweets that are more likely to be spam, (ii) near-duplicate cluster based annotation to firstly group similar tweets into clusters and then label the clusters, (iii) reliable ham tweets detection to label tweets that are non-spam, and (iv) Expectation-Maximization (EM)-based label prediction to predict the labels of remaining unlabeled tweets. One major contribution of this work is the creation of HSpam14 dataset, which can be used for hashtag-oriented spam research in tweets. Another contribution is the observations made from the preliminary analysis of the HSpam14 dataset.

[1]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[2]  Krishna P. Gummadi,et al.  Understanding and combating link farming in the twitter social network , 2012, WWW.

[3]  Claire Cardie,et al.  TopicSpam: a Topic-Model based approach for spam detection , 2013, ACL.

[4]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[5]  Claire Cardie,et al.  Finding Deceptive Opinion Spam by Any Stretch of the Imagination , 2011, ACL.

[6]  Zhe Wang,et al.  Filtering Image Spam with Near-Duplicate Detection , 2007, CEAS.

[7]  Huan Liu,et al.  Online Social Spammer Detection , 2014, AAAI.

[8]  Junhui Wang,et al.  Detecting group review spam , 2011, WWW.

[9]  Huan Liu,et al.  Leveraging knowledge across media for spammer detection in microblogging , 2014, SIGIR.

[10]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[11]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[12]  Yi Yang,et al.  Learning to Identify Review Spam , 2011, IJCAI.

[13]  Dawn Xiaodong Song,et al.  Suspended accounts in retrospect: an analysis of twitter spam , 2011, IMC '11.

[14]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[15]  Virgílio A. F. Almeida,et al.  Detecting Spammers and Content Promoters in Online Video Social Networks , 2009, IEEE INFOCOM Workshops 2009.

[16]  Ee-Peng Lim,et al.  Detecting product review spammers using rating behaviors , 2010, CIKM.

[17]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[18]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[19]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[20]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[21]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[22]  Qi He,et al.  TwitterRank: finding topic-sensitive influential twitterers , 2010, WSDM '10.

[23]  Georgia Koutrika,et al.  Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[24]  Byron Choi,et al.  Detecting spam blogs from blog search results , 2011, Inf. Process. Manag..

[25]  Kyumin Lee,et al.  The Dark Side of Micro-Task Marketplaces: Characterizing Fiverr and Automatically Detecting Crowdturfing , 2014, ICWSM.

[26]  Virgílio A. F. Almeida,et al.  Detecting Spammers on Twitter , 2010 .

[27]  Douglas W. Oard,et al.  Reducing Reliance on Relevance Judgments for System Comparison by Using Expectation-Maximization , 2014, ECIR.

[28]  Songqing Chen,et al.  UNIK: unsupervised social network spam detection , 2013, CIKM.

[29]  Markus Strohmaier,et al.  When Social Bots Attack: Modeling Susceptibility of Users in Online Social Networks , 2012, #MSM.

[30]  Fabrício Benevenuto,et al.  You followed my bot! Transforming robots into influential users in Twitter , 2013, First Monday.

[31]  Bing Liu,et al.  Opinion spam and analysis , 2008, WSDM '08.

[32]  Gang Wang,et al.  Serf and turf: crowdturfing for fun and profit , 2011, WWW.

[33]  V. Paxson,et al.  The Underground on 140 Characters or Less ∗ , 2010 .

[34]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[35]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[36]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[37]  Barbara Poblete,et al.  Information credibility on twitter , 2011, WWW.

[38]  Gordon V. Cormack,et al.  Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[39]  Tim Oates,et al.  Ensembles in adversarial classification for spam , 2009, CIKM.

[40]  Jiawei Han,et al.  Survey on web spam detection: principles and algorithms , 2012, SKDD.

[41]  Kyumin Lee,et al.  The social honeypot project: protecting online communities from spammers , 2010, WWW '10.

[42]  Kyumin Lee,et al.  Characterizing and automatically detecting crowdturfing in Fiverr and Twitter , 2015, Social Network Analysis and Mining.