Introduction to special section on adversarial issues in Web search
暂无分享,去创建一个
Over the past decade, Web search engines have become the predominant tool for Web users to locate information, and they have grown to be cornerstones of the Web economy, by driving traffic to commercial Web sites and by creating Web advertising platforms. The economic potential of search engines has given rise to adversaries that are trying to profit from search engines either by influencing search results or by redirecting advertisement revenue streams. The attraction of hundreds of millions of Web searches per day provides significant incentive for many content providers to do whatever is necessary to rank highly in search engine results, while search engine providers want to provide the most accurate results. The conflicting goals of search and content providers are adversarial, and the use of techniques that push rankings higher than they belong is often called search engine spam. Such methods typically include textual as well as link-based techniques, or their combination. This issue of TWEB contains three articles devoted to adversarial issues in Web search. The first article, entitled “Link Analysis for Web Spam Detection,” describes how machine learning techniques can be used to uncover statistical anomalies in the Web graph caused by link spam. The authors used a variety of statistical features as inputs to Web spam classifiers, and validated the effectiveness of the classifiers on the WebSpam-UK2002 and WebSpam-UK2006 data sets, two large reference collections of real Web pages that were manually assessed as to whether or not they are spam. The second article, entitled “Tracking Web Spam with HTML Style Similarities,” views stylistic similarities between the HTML markup of different Web pages as a predictive feature of Web spam. The basic intuition is that many spam Web pages are automatically generated, with varying content but similar layout. The article proposes a spam detection technique that consists of stripping out the content, thus reducing each document to just the HTML markup that governs layouts, grouping documents into clusters with highly similar layout, and providing the similarity and clustering as inputs to a Web spam classifier. The authors validated the effectiveness of their techniques on the WebSpam-UK2006 data set, the same reference collection used by the first article. The third article, entitled “Detecting Splogs via Temporal Dynamics Using Self-Similarity Analysis,” focuses on identifying Web logs or “blogs” that consist entirely of spam. Blogs contain a sequence of posts, each having a time-stamp indicating when it was added to the blog. The technique proposed in the article uses the time of posting in combination with the content of the post as identifying features of Web spam. The basic intuition is that for most spam blogs