A Self-Supervised Approach to Comment Spam Detection Based on Content Analysis

This paper studies the problems and threats posed by a type of spam in the blogosphere, called blog comment spam. It explores the challenges introduced by comment spam, generalizing the analysis substantially to any other short text type spam. The authors analyze different high-level features of spam and legitimate comments based on the content of blog postings. The authors use these features to cluster data separately for each feature using K-Means clustering algorithm. The authors also use self-supervised learning, which could classify spam and legitimate comments automatically. Compared with existing solutions, this approach demonstrates more flexibility and adaptability to the environment, as it requires minimal human intervention. The preliminary evaluation of the proposed spam detection system shows promising results.

[1]  Brian D. Davison,et al.  Identifying link farm pages , 2005, WWW 2005.

[2]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[3]  Tianshun Yao,et al.  An evaluation of statistical spam filtering techniques , 2004, TALIP.

[4]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[5]  José Mario García Valdez,et al.  A Comparative Study of Blog Comments Spam Filtering with Machine Learning Techniques , 2010, Soft Computing for Recognition Based on Biometrics.

[6]  Paolo Boldi,et al.  Adversarial information retrieval in the web , 2007 .

[7]  Susan T. Dumais,et al.  A Bayesian Approach to Filtering Junk E-Mail , 1998, AAAI 1998.

[8]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  Seungyeop Han Analysis of Blog Spams and Collaborative Blog Spam Filtering Using Adaptive Percolation Search , 2006 .

[11]  Vincent Kanade,et al.  Clustering Algorithms , 2021, Wireless RF Energy Transfer in the Massive IoT Era.

[12]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[13]  Harris Drucker,et al.  Support vector machines for spam categorization , 1999, IEEE Trans. Neural Networks.

[14]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[15]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[16]  Niels Provos,et al.  The Ghost in the Browser: Analysis of Web-based Malware , 2007, HotBots.

[17]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[18]  Calton Pu,et al.  Characterizing Web Spam Using Content and HTTP Session Analysis , 2007, CEAS.

[19]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[20]  Yan Zhang,et al.  Detecting Comment Spam through Content Analysis , 2010, WAIM Workshops.

[21]  Hamid R. Nemati,et al.  Information Security and Ethics: Concepts, Methodologies, Tools and Applications , 2008 .

[22]  Susan T. Dumais,et al.  What should blog search look like? , 2008, SSM '08.

[23]  Haralambos Mouratidis,et al.  Integrating Security and Software Engineering: Advances and Future Visions , 2006 .

[24]  Archana Bhattarai,et al.  Characterizing comment spam in the blogosphere through content analysis , 2009, 2009 IEEE Symposium on Computational Intelligence in Cyber Security.

[25]  Gilad Mishne Multiple Ranking Strategies for Opinion Retrieval in Blogs - The University of Amsterdam at the 2006 TREC Blog Track , 2006, TREC.

[26]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[27]  Tim Finin,et al.  Detecting spam blogs: an adaptive online approach , 2007 .

[28]  Gordon V. Cormack,et al.  Spam filtering for short messages , 2007, CIKM '07.

[29]  Nancy R. Mead Identifying Security Requirements Using the Security Quality Requirements Engineering (SQUARE) Method , 2007 .

[30]  Bing Liu,et al.  Review spam detection , 2007, WWW '07.

[31]  Vittal S. Anantatmula,et al.  Risk Management Instruments, Strategies and Their Impact on Project Success , 2013 .

[32]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[33]  Kazunari Ishida Extracting spam blogs with co-citation clusters , 2008, WWW.

[34]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[35]  José Mario García Valdez,et al.  A comparative study of machine learning techniques in blog comments spam filtering , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[36]  Fidelis Assis OSBF-Lua - A Text Classification Module for Lua: The Importance of the Training Method , 2006, TREC.

[37]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[38]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[39]  Hsin-Hsi Chen,et al.  Opinion Extraction, Summarization and Tracking in News and Blog Corpora , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.