Spam, Damn Spam, and Statistics

The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call “web spam”, that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index. We propose that some spam web pages can be identified through statistical analysis: Certain classes of spam pages, in particular those that are machine-generated, diverge in some of their properties from the properties of web pages at large. We have examined a variety of such properties, including linkage structure, page content, and page evolution, and have found that outliers in the statistical distribution of these properties are highly likely to be caused by web spam. This paper describes the properties we have examined, gives the statistical distributions we have observed, and shows which kinds of outliers are highly correlated with web spam.

[1]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[2]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[3]  Andrei Z. Broder,et al.  Graph structure in the Web , 2000, Comput. Networks.

[4]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[5]  Andrei Z. Broder,et al.  Efficient URL caching for world wide web crawling , 2003, WWW '03.

[6]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[7]  Marc Najork,et al.  On the evolution of clusters of near-duplicate Web pages , 2003, Proceedings of the IEEE/LEOS 3rd International Conference on Numerical Simulation of Semiconductor Optoelectronic Devices (IEEE Cat. No.03EX726).

[8]  Marc Najork,et al.  A large‐scale study of the evolution of Web pages , 2004, Softw. Pract. Exp..

[9]  Krishna Bharat,et al.  Who links to whom: mining linkage between Web sites , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[10]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[11]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.