Detecting splogs via temporal dynamics using self-similarity analysis

This article addresses the problem of spam blog (splog) detection using temporal and structural regularity of content, post time and links. Splogs are undesirable blogs meant to attract search engine traffic, used solely for promoting affiliate sites. Blogs represent popular online media, and splogs not only degrade the quality of search engine results, but also waste network resources. The splog detection problem is made difficult due to the lack of stable content descriptors. We have developed a new technique for detecting splogs, based on the observation that a blog is a dynamic, growing sequence of entries (or posts) rather than a collection of individual pages. In our approach, splogs are recognized by their temporal characteristics and content. There are three key ideas in our splog detection framework. (a) We represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts, to investigate the temporal changes of the post sequence. (b) We study the blog temporal characteristics using a visual representation derived from the self-similarity measures. The visual signature reveals correlation between attributes and posts, depending on the type of blogs (normal blogs and splogs). (c) We propose two types of novel temporal features to capture the splog temporal characteristics. In our splog detector, these novel features are combined with content based features. We extract a content based feature vector from blog home pages as well as from different parts of the blog. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM-based splog detector using proposed features on real world datasets, with appreciable results (90% accuracy).

[1]  Seungyeop Han Analysis of Blog Spams and Collaborative Blog Spam Filtering Using Adaptive Percolation Search , 2006 .

[2]  Iadh Ounis,et al.  The TREC Blogs06 Collection: Creating and Analysing a Blog Test Collection , 2006 .

[3]  Timothy W. Finin,et al.  SVMs for the Blogosphere: Blog Identification and Splog Detection , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[4]  Michael J. Swain,et al.  Color indexing , 1991, International Journal of Computer Vision.

[5]  Yun Chi,et al.  The Splog Detection Task and A Solution Based on Temporal and Link Properties , 2006, TREC.

[6]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[7]  John Langford,et al.  Telling humans and computers apart automatically , 2004, CACM.

[8]  Tim Oates,et al.  Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[9]  Thomas Lavergne,et al.  Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[10]  Anupam Joshi,et al.  Blog Track Open Task: Spam Blog Classification , 2006 .

[11]  David G. Stork,et al.  Pattern Classification , 1973 .

[12]  Yun Chi,et al.  Splog Detection using Content, Time and Link Structures , 2007, 2007 IEEE International Conference on Multimedia and Expo.

[13]  D. Ruelle,et al.  Recurrence Plots of Dynamical Systems , 1987 .

[14]  M E J Newman,et al.  Finding and evaluating community structure in networks. , 2003, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[16]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[17]  Dániel Fogaras,et al.  Scaling link-based similarity search , 2005, WWW '05.

[18]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[19]  Tie-Yan Liu,et al.  Detecting Link Spam Using Temporal Information , 2006, Sixth International Conference on Data Mining (ICDM'06).

[20]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[21]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[22]  LinYu-Ru,et al.  Detecting splogs via temporal dynamics using self-similarity analysis , 2008 .

[23]  Jonathan Foote,et al.  Audio Retrieval by Rhythmic Similarity , 2002, ISMIR.

[24]  Franco Salvetti,et al.  Weblog Classification for Fast Splog Filtering: A URL Language Model Segmentation Approach , 2006, NAACL.

[25]  Timothy W. Finin,et al.  Characterizing the Splogosphere , 2006, WWW 2006.

[26]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[27]  Kazuyuki Narisawa,et al.  Detecting Blog Spams using the Vocabulary Size of All Substrings in Their Copies , 2006 .

[28]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[29]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[30]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[31]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[32]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.