Web spam: a survey with vision for the archivist

While Web archive quality is endangered by Web spam, a side effect of the high commercial value of top-ranked search-engine results, so far Web spam filtering technologies are rarely used by Web archivists. In this paper we make the first attempt to disseminate existing methodology and envision a solution for Web archives to share knowledge and unite efforts in Web spam hunting. We survey the state of the art in Web spam filtering illustrated by the recent Web spam challenge data sets and techniques and describe the filtering solution for archives envisioned in the LiWA—Living Web Archives project.

[1]  Luca Becchetti,et al.  Link-Based Characterization and Detection of Web Spam , 2006, AIRWeb.

[2]  Károly Csalogány,et al.  Semi-supervised learning: a comparative study for web spam and telephone user churn , 2007 .

[3]  Jacob Abernethy WITCH: A NEW APPROACH TO WEB SPAM DETECTION , 2008 .

[4]  Ludovic Denoyer,et al.  Web spam challenge 2008 , 2008, AIRWeb 2008.

[5]  Marc Najork,et al.  Detecting spam web pages through content analysis , 2006, WWW '06.

[6]  Amit Singhal,et al.  Challenges in running a commercial search engine , 2005, SIGIR '05.

[7]  Hector Garcia-Molina,et al.  Web Spam Taxonomy , 2005, AIRWeb.

[8]  Hector Garcia-Molina,et al.  Spam: it's not just for inboxes anymore , 2005, Computer.

[9]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[10]  Eli Upfal,et al.  Web search using automatic classification , 1996, WWW 1996.

[11]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[12]  David Carmel,et al.  The connectivity sonar: detecting site functionality by structural patterns , 2003, HYPERTEXT '03.

[13]  Jácint Szabó,et al.  Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[14]  Brian D. Davison,et al.  AIRWeb 2007 : proceedings of the 3rd International Workshop on Adversarial Information Retrieval on the Web, May 8, 2007, Banff, Alberta, Canada , 2007 .

[15]  William W. Cohen,et al.  Stacked Graphical Models for Efficient Inference in Markov Random Fields , 2007, SDM.

[16]  Hector Garcia-Molina,et al.  Combating Web Spam with TrustRank , 2004, VLDB.

[17]  Brian D. Davison,et al.  Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[18]  Marc Najork,et al.  Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[19]  Tobias Scheffer,et al.  Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[20]  András A. Benczúr,et al.  SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[21]  Sebastiano Vigna,et al.  Temporal Evolution of the UK Web , 2008, 2008 IEEE International Conference on Data Mining Workshops.

[22]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[23]  David Maxwell Chickering,et al.  Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[24]  Alois Potton Spam , 2003, PIK Prax. Informationsverarbeitung Kommun..

[25]  Andreas Hotho,et al.  The anti-social tagger: detecting spam in social bookmarking systems , 2008, AIRWeb '08.

[26]  Rajeev Motwani,et al.  Stratified Planning , 2009, IJCAI.

[27]  Fabrizio Silvestri,et al.  Know your neighbors: web spam detection using the web topology , 2007, SIGIR.

[28]  Brian D. Davison,et al.  Web Spam Challenge , 2007 .

[29]  John R. Douceur,et al.  The Sybil Attack , 2002, IPTPS.

[30]  András A. Benczúr,et al.  Web spam detection via commercial intent analysis , 2007, AIRWeb '07.

[31]  Sebastiano Vigna,et al.  A large time-aware web graph , 2008, SIGF.

[32]  Nitesh V. Chawla,et al.  Editorial: special issue on learning from imbalanced data sets , 2004, SKDD.

[33]  Marc Najork,et al.  Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[34]  Brian D. Davison,et al.  Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[35]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[36]  Hector Garcia-Molina,et al.  The Eigentrust algorithm for reputation management in P2P networks , 2003, WWW '03.

[37]  Brian D. Davison Recognizing Nepotistic Links on the Web , 2000 .

[38]  Virgílio A. F. Almeida,et al.  Identifying video spammers in online social networks , 2008, AIRWeb '08.

[39]  Blaz Zupan,et al.  Spam Filtering Using Statistical Data Compression Models , 2006, J. Mach. Learn. Res..

[40]  Sebastiano Vigna,et al.  UbiCrawler: a scalable fully distributed Web crawler , 2004, Softw. Pract. Exp..

[41]  Gilad Mishne,et al.  Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[42]  András A. Benczúr,et al.  Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[43]  Ramanathan V. Guha,et al.  Propagation of trust and distrust , 2004, WWW '04.