NEIGHBORWATCHER: A Content-Agnostic Comment Spam Inference System

Comment spam has become a popular means for spammers to attract direct visits to target websites, or to manipulate search ranks of the target websites. Through posting a small number of spam messages on each victim website (e.g., normal websites such as forums, wikis, guestbooks, and blogs, which we term as spam harbors in this paper) but spamming on a large variety of harbors, spammers can not only directly inherit some reputations from these harbors but also avoid content-based detection systems deployed on these harbors. To find such qualified harbors, spammers always have their own preferred ways based on their available resources and the cost (e.g., easiness of automatic posting, chances of content sanitization on the website). As a result, they will generate their own relatively stable set of harbors proved to be easy and friendly to post their spam, which we refer to as their spamming infrastructure. Our measurement also shows that for different spammers, their spamming infrastructures are typically different, although sometimes with some overlap. This paper presents NEIGHBORWATCHER, a comment spam inference system that exploits spammers’ spamming infrastructure information to infer comment spam. At its core, NEIGHBORWATCHER runs a graph-based algorithm to characterize the spamming neighbor relationship, and reports a spam link when the same link also appears in the harbor’s clique neighbors. Starting from a small seed set of known spam links, our system inferred roughly 91,636 comment spam, and 16,694 spam harbors that are frequently utilized by comment spammers. Furthermore, our evaluation on real-world data shows that NEIGHBORWATCHER can keep inferring new comment spam and finding new spam harbors every day.

[1]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[2]  Tyler Moore,et al.  Measuring and Analyzing Search-Redirection Attacks in the Illicit Online Prescription Drug Trade , 2011, USENIX Security Symposium.

[3]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[4]  Steven Myers,et al.  Prevalence and mitigation of forum spamming , 2011, 2011 Proceedings IEEE INFOCOM.

[5]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[6]  Songqing Chen,et al.  Spammer Behavior Analysis and Detection in User Generated Content on Social Networks , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[7]  Pawan Kumar,et al.  Notice of Violation of IEEE Publication Principles The Anatomy of a Large-Scale Hyper Textual Web Search Engine , 2009 .

[8]  Phillip A. Porras,et al.  Highly Predictive Blacklisting , 2008, USENIX Security Symposium.

[9]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[10]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[11]  Steven Myers,et al.  The Nuts and Bolts of a Forum Spam Automator , 2011, LEET.

[12]  Chao Yang,et al.  PoisonAmplifier: A Guided Approach of Discovering Compromised Websites through Reversing Search Poisoning Attacks , 2012, RAID.

[13]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[14]  Ling Huang,et al.  Robust detection of comment spam using entropy rate , 2012, AISec.

[15]  Santosh S. Vempala,et al.  Filtering spam with behavioral blacklisting , 2007, CCS '07.

[16]  Hao Chen,et al.  A Quantitative Study of Forum Spamming Using Context-based Analysis , 2007, NDSS.

[17]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[18]  Nick Feamster,et al.  Spam or ham?: characterizing and detecting fraudulent "not spam" reports in web mail systems , 2011, CEAS '11.