Spammer Behavior Analysis and Detection in User Generated Content on Social Networks

Spam content is surging with an explosive increase of user generated content (UGC) on the Internet. Spammers often insert popular keywords or simply copy and paste recent articles from the Web with spam links inserted, attempting to disable content-based detection. In order to effectively detect spam in user generated content, we first conduct a comprehensive analysis of spamming activities on a large commercial UGC site in 325 days covering over 6 million posts and nearly 400 thousand users. Our analysis shows that UGC spammers exhibit unique non-textual patterns, such as posting activities, advertised spam link metrics, and spam hosting behaviors. Based on these non-textual features, we show via several classification methods that a high detection rate could be achieved offline. These results further motivate us to develop a runtime scheme, BARS, to detect spam posts based on these spamming patterns. The experimental results demonstrate the effectiveness and robustness of BARS.

[1]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[2]  Nick Feamster,et al.  Understanding the network-level behavior of spammers , 2006, SIGCOMM.

[3]  Yun Chi,et al.  Splog detection using self-similarity analysis on blog temporal dynamics , 2007, AIRWeb '07.

[4]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[5]  Takehito Utsuro,et al.  An empirical study on selective sampling in active learning for splog detection , 2009, AIRWeb '09.

[6]  Jun Hu,et al.  Detecting and characterizing social spam campaigns , 2010, CCS '10.

[7]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[8]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[9]  Constantine D. Spyropoulos,et al.  An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages , 2000, SIGIR '00.

[10]  Lawrence K. Saul,et al.  Beyond blacklists: learning to detect malicious web sites from suspicious URLs , 2009, KDD.

[11]  Christos Faloutsos,et al.  Modeling Blog Dynamics , 2009, ICWSM.

[12]  Virgílio A. F. Almeida,et al.  Characterizing a spam traffic , 2004, IMC '04.

[13]  Noriko Kando,et al.  Analysing features of Japanese splogs and characteristics of keywords , 2008, AIRWeb '08.

[14]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[15]  Hao Chen,et al.  A Quantitative Study of Forum Spamming Using Context-based Analysis , 2007, NDSS.

[16]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[17]  Emil Sit,et al.  An empirical study of spam traffic and the use of DNS black lists , 2004, IMC '04.

[18]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[19]  Mark Delany,et al.  Domain-Based Email Authentication Using Public Keys Advertised in the DNS (DomainKeys) , 2007, RFC.

[20]  Geoff Hulten,et al.  Spamming botnets: signatures and characteristics , 2008, SIGCOMM '08.

[21]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[22]  Kyumin Lee,et al.  Uncovering social spammers: social honeypots + machine learning , 2010, SIGIR.

[23]  Hao Chen,et al.  Spam double-funnel: connecting web spammers with advertisers , 2007, WWW '07.

[24]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[25]  Timothy W. Finin,et al.  Characterizing the Splogosphere , 2006, WWW 2006.