Web spam filtering in internet archives

While Web spam is targeted for the high commercial value of top-ranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

[1]  András A. Benczúr,et al.  Web spam challenge proposal for filtering in archives , 2009, AIRWeb '09.

[2]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[3]  Ian Witten,et al.  Data Mining , 2000 .

[4]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[5]  Julien Masanès,et al.  Web Archiving , 2014, Encyclopedia of Social Network Analysis and Mining.

[6]  Jacob Abernethy WITCH: A NEW APPROACH TO WEB SPAM DETECTION , 2008 .

[7]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[8]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[9]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[10]  Brian D. Davison,et al.  Identifying link farm pages , 2005, WWW 2005.

[11]  Hector Garcia-Molina,et al.  Link spam detection based on mass estimation , 2006, VLDB.

[12]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[13]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[14]  Sebastiano Vigna,et al.  Temporal Evolution of the UK Web , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[15]  Hector Garcia-Molina,et al.  Synchronizing a database to improve freshness , 2000, SIGMOD '00.

[16]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[17]  Luca Becchetti,et al.  A reference collection for web spam , 2006, SIGF.

[18]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[19]  Brian D. Davison,et al.  Identifying link farm spam pages , 2005, WWW '05.

[20]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[21]  Andrei Z. Broder,et al.  Sic transit gloria telae: towards an understanding of the web's decay , 2004, WWW '04.

[22]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[23]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[24]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[25]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[26]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[27]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[28]  Michael Stonebraker,et al.  The Morgan Kaufmann Series in Data Management Systems , 1999 .

[29]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[30]  Hector Garcia-Molina,et al.  Link Spam Alliances , 2005, VLDB.

[31]  Hector Garcia-Molina,et al.  The Evolution of the Web and Implications for an Incremental Crawler , 2000, VLDB.

[32]  Kevin S. McCurley,et al.  Ranking the web frontier , 2004, WWW '04.

[33]  Sebastiano Vigna,et al.  A large time-aware web graph , 2008, SIGF.