论文信息 - Adversarial Web Search

Adversarial Web Search

Web search engines have become indispensable tools for finding content. As the popularity of the Web has increased, the efforts to exploit the Web for commercial, social, or political advantage have grown, making it harder for search engines to discriminate between truthful signals of content quality and deceptive attempts to game search engines' rankings. This problem is further complicated by the open nature of the Web, which allows anyone to write and publish anything, and by the fact that search engines must analyze ever-growing numbers of Web pages. Moreover, increasing expectations of users, who over time rely on Web search for information needs related to more aspects of their lives, further deepen the need for search engines to develop effective counter-measures against deception. In this monograph, we consider the effects of the adversarial relationship between search systems and those who wish to manipulate them, a field known as "Adversarial Information Retrieval". We show that search engine spammers create false content and misleading links to lure unsuspecting visitors to pages filled with advertisements or malware. We also examine work over the past decade or so that aims to discover such spamming activities to get spam pages removed or their effect on the quality of the results reduced. Research in Adversarial Information Retrieval has been evolving over time, and currently continues both in traditional areas (e.g., link spam) and newer areas, such as click fraud and spam in social media, demonstrating that this conflict is far from over.

Brian D. Davison | Carlos Castillo | C. Castillo

[1] É. Tardos,et al. Algorithmic Game Theory: Network Formation Games and the Potential Function Method , 2007 .

[2] Tyler Moore,et al. Evil Searching: Compromise and Recompromise of Internet Hosts for Phishing , 2009, Financial Cryptography.

[3] Thomas Lavergne,et al. Tracking Web spam with HTML style similarities , 2008, TWEB.

[4] Hector Garcia-Molina,et al. Link Spam Alliances , 2005, VLDB.

[5] Idit Keidar,et al. Do not crawl in the DUST: different URLs with similar text , 2006, WWW.

[6] Tao Tao,et al. Transductive link spam detection , 2007, AIRWeb '07.

[7] William W. Cohen,et al. Stacked Graphical Learning , 2006 .

[8] Brian D. Davison,et al. Propagating Trust and Distrust to Demote Web Spam , 2006, MTW.

[9] Paolo Massa,et al. Page-reRank: using trusted links to re-rank authority , 2005, The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05).

[10] Juan Martínez-Romo,et al. Web spam identification through language model analysis , 2009, AIRWeb '09.

[11] Calton Pu,et al. Introducing the Webb Spam Corpus: Using Email Spam to Identify Web Spam Automatically , 2006, CEAS.

[12] John R. Douceur,et al. The Sybil Attack , 2002, IPTPS.

[13] Daniel Sheldon,et al. Manipulation of PageRank and Collective Hidden Markov Models , 2010 .

[14] Jennifer Grappone,et al. Search Engine Optimization: An Hour a Day , 2006 .

[15] Anestis Gkanogiannis,et al. A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems , 2008 .

[16] Torsten Suel,et al. Cleaning search results using term distance features , 2008, AIRWeb '08.

[17] Andreas Ramos,et al. Search Engine Marketing , 2008 .

[18] Xin Zhao,et al. Using spam farm to boost PageRank , 2007, AIRWeb '07.

[19] Benjamin Van Roy,et al. Detecting colluders in pagerank: finding slow mixing states in a markov chain , 2005 .

[20] Marc Najork,et al. Spam, damn spam, and statistics: using statistical analysis to locate spam web pages , 2004, WebDB '04.

[21] Ingmar Weber,et al. An Analysis of Factors Used in Search Engine Ranking , 2005, AIRWeb.

[22] Jacques Savoy,et al. Term Proximity Scoring for Keyword-Based Retrieval Systems , 2003, ECIR.

[23] Marc Najork,et al. Detecting spam web pages through content analysis , 2006, WWW '06.

[24] Yiqun Liu,et al. Identifying web spam with user behavior analysis , 2008, AIRWeb '08.

[25] J. Ross Quinlan,et al. C4.5: Programs for Machine Learning , 1992 .

[26] Thorsten Joachims,et al. Eye-tracking analysis of user behavior in WWW search , 2004, SIGIR '04.

[27] Ee-Peng Lim,et al. Measuring article quality in wikipedia: models and evaluation , 2007, CIKM '07.

[28] Carl D. Meyer,et al. Deeper Inside PageRank , 2004, Internet Math..

[29] Ramesh Govindan,et al. Making Eigenvector-Based Reputation Systems Robust to Collusion , 2004, WAW.

[30] D. Kossmann,et al. What can you do with a Web in your Pocket ? , 2007 .

[31] Gerard Salton,et al. A vector space model for automatic indexing , 1975, CACM.

[32] Quan Zhang,et al. EviRank: An Evidence Based Content Trust Model for Web Spam Detection , 2007, APWeb/WAIM Workshops.

[33] Amit Singhal,et al. Challenges in running a commercial search engine , 2005, SIGIR '05.

[34] Hector Garcia-Molina,et al. Link spam detection based on mass estimation , 2006, VLDB.

[35] O. Chapelle,et al. Semi-supervised classification with hyperlinks , 2007 .

[36] Mark Levene,et al. Search Engines: Information Retrieval in Practice , 2011, Comput. J..

[37] Sergey Brin,et al. The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[38] Sriram Raghavan,et al. Searching the Web , 2001, ACM Trans. Internet Techn..

[39] Hao Chen,et al. Spam double-funnel: connecting web spammers with advertisers , 2007, WWW '07.

[40] Ciro Cattuto,et al. Social spam detection , 2009, AIRWeb '09.

[41] Yan Zhang,et al. Larger is better: seed selection in link-based anti-spamming algorithms , 2008, WWW.

[42] Bing Liu,et al. Review spam detection , 2007, WWW '07.

[43] George Karypis,et al. Multilevel k-way Partitioning Scheme for Irregular Graphs , 1998, J. Parallel Distributed Comput..

[44] Kristopher B. Jones,et al. Search Engine Optimization , 2008 .

[45] Ian H. Witten,et al. Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[46] Yun Chi,et al. Detecting splogs via temporal dynamics using self-similarity analysis , 2008, TWEB.

[47] Martin Ester,et al. TrustWalker: a random walk model for combining trust-based and item-based recommendation , 2009, KDD.

[48] Jácint Szabó,et al. Latent dirichlet allocation in web spam filtering , 2008, AIRWeb '08.

[49] Luca de Alfaro,et al. A content-driven reputation system for the wikipedia , 2007, WWW '07.

[50] Jácint Szabó,et al. Linked latent Dirichlet allocation in web spam filtering , 2009, AIRWeb '09.

[51] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[52] Thorsten Joachims,et al. Accurately Interpreting Clickthrough Data as Implicit Feedback , 2017 .

[53] Thorsten Joachims,et al. Optimizing search engines using clickthrough data , 2002, KDD.

[54] Brian D. Davison,et al. Undue influence: eliminating the impact of link plagiarism on web search rankings , 2006, SAC.

[55] Hector Garcia-Molina,et al. Web Spam Taxonomy , 2005, AIRWeb.

[56] Vahab S. Mirrokni,et al. Local Computation of PageRank Contributions , 2007, Internet Math..

[57] Tie-Yan Liu,et al. BrowseRank: letting web users vote for page importance , 2008, SIGIR '08.

[58] Carlos Castillo,et al. Effective web crawling , 2005, SIGF.

[59] Brian D. Davison,et al. Detecting semantic cloaking on the web , 2006, WWW '06.

[60] Judit Bar-Ilan. Web links and search engine ranking: The case of Google and the query “jew” , 2006 .

[61] Ravi Kumar,et al. Discovering Large Dense Subgraphs in Massive Graphs , 2005, VLDB.

[62] Hector Garcia-Molina,et al. Combating Web Spam with TrustRank , 2004, VLDB.

[63] Brian D. Davison. Recognizing Nepotistic Links on the Web , 2000 .

[64] Eugene Agichtein,et al. A few bad votes too many?: towards robust ranking in social media , 2008, AIRWeb '08.

[65] Dror G. Feitelson,et al. Distinguishing humans from robots in web search logs: preliminary results using query rates and intervals , 2009, WSCD '09.

[66] Harrison Rainie. The Future of the Internet , 2008 .

[67] Thore Graepel,et al. Large Margin Rank Boundaries for Ordinal Regression , 2000 .

[68] Gordon V. Cormack,et al. Email Spam Filtering: A Systematic Review , 2008, Found. Trends Inf. Retr..

[69] Virgílio A. F. Almeida,et al. Identifying video spammers in online social networks , 2008, AIRWeb '08.

[70] Nicole Immorlica,et al. Click Fraud Resistant Methods for Learning Click-Through Rates , 2005, WINE.

[71] Yolanda Gil,et al. Towards content trust of web resources , 2006, WWW '06.

[72] M. Wendy Hennequin,et al. The Future of the Internet and How to Stop It , 2011 .

[73] Paolo Boldi,et al. Adversarial information retrieval in the web , 2007 .

[74] Lakshminarayanan Subramanian,et al. Sybil-Resilient Online Content Voting , 2009, NSDI.

[75] Jun-Lin Lin. Detection of cloaked web spam by using tag-based methods , 2009, Expert Syst. Appl..

[76] Rajeev Motwani,et al. The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[77] Kevin S. McCurley,et al. Ranking the web frontier , 2004, WWW '04.

[78] Adam Thomason. Blog Spam: A Review , 2007, CEAS.

[79] Luca Becchetti,et al. Using rank propagation and Probabilistic counting for Link-Based Spam Detection , 2006 .

[80] David Hawking,et al. Nullification test collections for web spam and SEO , 2009, AIRWeb '09.

[81] Ling Liu,et al. PeerTrust: supporting reputation-based trust for peer-to-peer electronic communities , 2004, IEEE Transactions on Knowledge and Data Engineering.

[82] Bing Liu,et al. Analyzing and Detecting Review Spam , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[83] Marc Najork,et al. Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[84] Andreas Hotho,et al. The anti-social tagger: detecting spam in social bookmarking systems , 2008, AIRWeb '08.

[85] Mike Moran,et al. Search Engine Marketing, Inc.: Driving Search Traffic to Your Company's Web Site , 2005 .

[86] Noriko Kando,et al. Analysing features of Japanese splogs and characteristics of keywords , 2008, AIRWeb '08.

[87] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[88] Deborah L. McGuinness,et al. Investigations into Trust for Collaborative Information Repositories: A Wikipedia Case Study , 2006, MTW.

[89] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[90] Nicole Immorlica,et al. Game-Theoretic Aspects of Designing Hyperlink Structures , 2006, WINE.

[91] Aristides Gionis,et al. Query-log mining for detecting polysemy and spam , 2008 .

[92] Marios D. Dikaiakos,et al. Web robot detection: A probabilistic reasoning approach , 2009, Comput. Networks.

[93] András A. Benczúr,et al. Detecting nepotistic links by language model disagreement , 2006, WWW '06.

[94] Gregory Buehrer,et al. A large-scale study of automated web search traffic , 2008, AIRWeb '08.

[95] L. Mui,et al. A computational model of trust and reputation , 2002, Proceedings of the 35th Annual Hawaii International Conference on System Sciences.

[96] Alistair Moffat,et al. Some Observations on User Search Behaviour , 2006, Aust. J. Intell. Inf. Process. Syst..

[97] Yan Zhang,et al. From Good to Bad Ones: Making Spam Detection Easier , 2008, 2008 IEEE 8th International Conference on Computer and Information Technology Workshops.

[98] Aiko M. Hormann,et al. Programs for Machine Learning. Part I , 1962, Inf. Control..

[99] Judit Bar-Ilan,et al. Google Bombing from a Time Perspective , 2007, J. Comput. Mediat. Commun..

[100] Stephen E. Robertson,et al. Okapi at TREC-3 , 1994, TREC.

[101] Tina Liu. ANALYZING THE IMPORTANCE OF GROUP STRUCTURE IN THE GOOGLE PAGERANK ALGORITHM , 2004 .

[102] Pavel Berkhin,et al. A Survey on PageRank Computing , 2005, Internet Math..

[103] David Maxwell Chickering,et al. Improving Cloaking Detection using Search Query Popularity and Monetizability , 2006, AIRWeb.

[104] Georgia Koutrika,et al. Fighting Spam on Social Web Sites: A Survey of Approaches and Future Challenges , 2007, IEEE Internet Computing.

[105] Ian H. Witten,et al. The bubble of web visibility , 2005, CACM.

[106] Rajeev Motwani,et al. Stratified Planning , 2009, IJCAI.

[107] Hinrich Schütze,et al. Introduction to information retrieval , 2008 .

[108] Brian D. Davison,et al. Knowing a web page by the company it keeps , 2006, CIKM '06.

[109] Yan Zhang,et al. Exploring both Content and Link Quality for Anti-Spamming , 2006, The Sixth IEEE International Conference on Computer and Information Technology (CIT'06).

[110] Brian D. Davison,et al. Winnowing wheat from the chaff: propagating trust to sift spam from the web , 2007, SIGIR.

[111] Vipin Kumar,et al. Discovery of Web Robot Sessions Based on their Navigational Patterns , 2004, Data Mining and Knowledge Discovery.

[112] Aristides Gionis,et al. Query-log mining for detecting spam , 2008, AIRWeb '08.

[113] Eric J. Friedman,et al. Manipulability of PageRank under Sybil Strategies , 2006 .

[114] Gilad Mishne,et al. Blocking Blog Spam with Language Model Disagreement , 2005, AIRWeb.

[115] Brian D. Davison,et al. A cautious surfer for PageRank , 2007, WWW '07.

[116] Tim Oates,et al. Detecting Spam Blogs: A Machine Learning Approach , 2006, AAAI.

[117] Philip R. Zimmermann,et al. The official PGP user's guide , 1996 .

[118] Luca Becchetti,et al. A reference collection for web spam , 2006, SIGF.

[119] Andrew C. Brod,et al. Advantageous Semi‐Collusion , 2003 .

[120] Nisheeth Shrivastava,et al. Mining (Social) Network Graphs to Detect Random Link Attacks , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[121] Kumar Chellapilla,et al. A taxonomy of JavaScript redirection spam , 2007, AIRWeb '07.

[122] Ling Liu,et al. Towards robust trust establishment in web-based social networks with socialtrust , 2008, WWW.

[123] Dmitri Loguinov,et al. IRLbot: scaling to 6 billion pages and beyond , 2008, WWW.

[124] Baoning Wu,et al. Finding and fighting search engine spam , 2007 .

[125] Amy Nicole Langville,et al. Google's PageRank and beyond - the science of search engine rankings , 2006 .

[126] Timothy W. Finin,et al. Characterizing the Splogosphere , 2006, WWW 2006.

[127] Neil Daswani,et al. The Anatomy of Clickbot.A , 2007, HotBots.

[128] Wolfgang Nejdl,et al. Site level noise removal for search engines , 2006, WWW '06.

[129] Brian D. Davison,et al. Web page classification: Features and algorithms , 2009, CSUR.

[130] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[131] Fabrizio Silvestri,et al. Mining Query Logs , 2009, ECIR.

[132] W. Bruce Croft,et al. Search Engines - Information Retrieval in Practice , 2009 .

[133] Brian D. Davison,et al. Identifying link farm spam pages , 2005, WWW '05.

[134] Alexandros Ntoulas,et al. Crawling and searching the hidden web , 2006 .

[135] András A. Benczúr,et al. SpamRank -- Fully Automatic Link Spam Detection , 2005, AIRWeb.

[136] Dawid Weiss,et al. Exploring linguistic features for web spam detection: a preliminary study , 2008, AIRWeb '08.

[137] Brian D. Davison,et al. Looking into the past to better classify web spam , 2009, AIRWeb '09.

[138] Gerard Salton,et al. Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[139] Andrew Clausen,et al. Online Reputation Systems: The Cost of Attack of PageRank , 2003 .

[140] John E. Hopcroft,et al. Network Reputation Games , 2008 .

[141] Aleksandar Kuzmanovic,et al. How to Improve Your Google Ranking: Myths and Reality , 2010, 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.

[142] Wendell Bell,et al. The Third Wave. , 1982 .

[143] Brian D. Davison,et al. Topical TrustRank: using topicality to combat web spam , 2006, WWW '06.

[144] Nick Craswell,et al. An experimental comparison of click position-bias models , 2008, WSDM '08.

[145] Qiang Wu,et al. Improving web spam classification using rank-time features , 2007, AIRWeb '07.

[146] Anestis Gkanogiannis,et al. An algorithm for text categorization , 2008, SIGIR '08.

[147] Nina Mishra,et al. Releasing search queries and clicks privately , 2009, WWW '09.

[148] Marc Najork,et al. Detecting phrase-level duplication on the world wide web , 2005, SIGIR '05.

[149] Tobias Scheffer,et al. Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam , 2005, ECML.

[150] Masaru Kitsuregawa,et al. A study of link farm distribution and evolution using a time series of web snapshots , 2009, AIRWeb '09.

[151] Michael Sirivianos,et al. FaceTrust: Assessing the Credibility of Online Personas via Social Networks , 2009, IACR Cryptol. ePrint Arch..

[152] Hector Garcia-Molina,et al. Taxonomy of trust: Categorizing P2P reputation systems , 2006, Comput. Networks.

[153] Ling Liu,et al. Tamper-resilient methods for web-based open systems , 2007 .

[154] Georgia Koutrika,et al. Combating spam in tagging systems , 2007, AIRWeb '07.

[155] Brian D. Davison,et al. Adversarial information retrieval on the web (AIRWeb 2006) , 2006, SIGF.

[156] Thomas Lavergne,et al. Tracking Web Spam with Hidden Style Similarity , 2006, AIRWeb.

[157] Maxim Lifantsev. Voting Model for Ranking Web Pages , 2000, International Conference on Internet Computing.

[158] Thierson Couto,et al. A Hypergraph Model for Computing Page Reputation on Web Collections , 2007, SBBD.

[159] Dino Pedreschi,et al. Discovery of ads web hosts through traffic data analysis , 2004, DMKD '04.

[160] Gilad Mishne,et al. Applied text analytics for blogs , 2007 .

[161] Fabrizio Silvestri,et al. Mining Query Logs: Turning Search Usage Data into Knowledge , 2010, Found. Trends Inf. Retr..

[162] Yun Chi,et al. Splog detection using self-similarity analysis on blog temporal dynamics , 2007, AIRWeb '07.

[163] Tim Finin,et al. Detecting spam blogs: an adaptive online approach , 2007 .

[164] Christopher Olston,et al. What's new on the web?: the evolution of the web from a search engine perspective , 2004, WWW '04.

[165] Jian Pei,et al. Sketching Landscapes of Page Farms , 2007, SDM.

[166] Calton Pu,et al. Automatic identification and removal of low quality online information , 2008 .

[167] Xuxian Jiang,et al. Automated Web Patrol with Strider HoneyMonkeys: Finding Web Sites That Exploit Browser Vulnerabilities , 2006, NDSS.

[168] András A. Benczúr,et al. Link-Based Similarity Search to Fight Web Spam , 2006, AIRWeb.

[169] Niels Provos,et al. The Ghost in the Browser: Analysis of Web-based Malware , 2007, HotBots.

[170] Takehito Utsuro,et al. An empirical study on selective sampling in active learning for splog detection , 2009, AIRWeb '09.

[171] Ramanathan V. Guha,et al. Propagation of trust and distrust , 2004, WWW '04.

[172] Emin Gün Sirer,et al. Fighting peer-to-peer SPAM and decoys with object reputation , 2005, P2PECON '05.

[173] Lik Mui,et al. A Computational Model of Trust and Reputation for E-businesses , 2002 .

[174] Shlomo Moran,et al. The stochastic approach for link-structure analysis (SALSA) and the TKC effect , 2000, Comput. Networks.

[175] Niels Provos,et al. All Your iFRAMEs Point to Us , 2008, USENIX Security Symposium.

[176] Rashmi Raj,et al. Web Spam Detection with Anti-Trust Rank , 2006, AIRWeb.

[177] Gareth O. Roberts,et al. Downweighting tightly knit communities in world wide web ranking. , 2003 .

[178] Virgílio A. F. Almeida,et al. Comparative Graph Theoretical Characterization of Networks of Spam , 2005, CEAS.

[179] Carlos Castillo,et al. Graph regularization methods for Web spam detection , 2010, Machine Learning.

[180] P. Gramme. RANK for spam detection ECML-Discovery Challenge , 2008 .

[181] Georgia Koutrika,et al. Combating spam in tagging systems: An evaluation , 2008, TWEB.

[182] Joan Feigenbaum,et al. On graph problems in a semi-streaming model , 2005, Theor. Comput. Sci..

[183] Steven D. Gribble,et al. A Crawler-based Study of Spyware in the Web , 2006, NDSS.

[184] P. Oscar Boykin,et al. Personal Email Networks: An Effective Anti-Spam Tool , 2004, ArXiv.

[185] Timothy W. Finin,et al. Towards Spam Detection at Ping Servers , 2007, ICWSM.

[186] Hector Garcia-Molina,et al. Spam: it's not just for inboxes anymore , 2005, Computer.

[187] Bing Liu,et al. Opinion spam and analysis , 2008, WSDM '08.

[188] Calton Pu,et al. Predicting web spam with HTTP session information , 2008, CIKM '08.

[189] Luca Becchetti,et al. Link analysis for Web spam detection , 2008, TWEB.

[190] Eric Brill,et al. Beyond PageRank: machine learning for static ranking , 2006, WWW '06.

[191] Ravi Kumar,et al. On anonymizing query logs via token-based hashing , 2007, WWW '07.

[192] Christian Platzer,et al. Removing web spam links from search engine results , 2011, Journal in Computer Virology.

[193] Brian D. Davison,et al. Cloaking and Redirection: A Preliminary Study , 2005, AIRWeb.

[194] Ling Liu,et al. Spam-Resilient Web Rankings via Influence Throttling , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[195] Bin Zhou,et al. MINING PAGE FARMS AND ITS APPLICATION IN LINK SPAM DETECTION , 2007 .

[196] Alexander Aiken,et al. Attack-Resistant Trust Metrics for Public Key Certification , 1998, USENIX Security Symposium.

[197] Hector Garcia-Molina,et al. The Eigentrust algorithm for reputation management in P2P networks , 2003, WWW '03.

[198] Michael Kaminsky,et al. SybilGuard: defending against sybil attacks via social networks , 2006, SIGCOMM.

[199] Taher H. Haveliwala. Topic-sensitive PageRank , 2002, IEEE Trans. Knowl. Data Eng..

[200] Gerhard Weikum,et al. Computing trusted authority scores in peer-to-peer web search networks , 2007, AIRWeb '07.

[201] John E. Hopcroft,et al. Manipulation-Resistant Reputations Using Hitting Time , 2007, WAW.

[202] Ling Liu,et al. Countering web spam with credibility-based link analysis , 2007, PODC '07.

[203] Ben Gerson. The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture , 2005 .

[204] Brian D. Davison,et al. Measuring similarity to detect qualified links , 2007, AIRWeb '07.

[205] Yi Zhu,et al. Click Fraud , 2009, Mark. Sci..

[206] Charles L. A. Clarke,et al. Term proximity scoring for ad-hoc retrieval on very large text collections , 2006, SIGIR.