Accelerated focused crawling through online relevance feedback

The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.

[1]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[2]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[3]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Eli Upfal,et al.  Stochastic models for the Web graph , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[6]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[7]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[8]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[9]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[10]  Ravi Kumar,et al.  Topic Distillation and Spectral Filtering , 1999, Artificial Intelligence Review.

[11]  Andrew McCallum,et al.  Using Reinforcement Learning to Spider the Web Efficiently , 1999, ICML.

[12]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[13]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[14]  Dr P M E De Bra Searching for Arbitrary Information in the WWW : the Fish − Search for Mosaic , 1994 .

[15]  T. Joachims WebWatcher : A Tour Guide for the World Wide Web , 1997 .

[16]  Filippo Menczer Links tell us about lexical and semantic Web content , 2001, ArXiv.

[17]  Henry Lieberman,et al.  Exploring the Web with reconnaissance agents , 2001, Commun. ACM.

[18]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.

[19]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[20]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[21]  Henry Lieberman,et al.  Letizia: An Agent That Assists Web Browsing , 1995, IJCAI.

[22]  Luís Torgo,et al.  Regression by Classification , 1996, SBIA.

[23]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[24]  Sougata Mukherjea,et al.  WTMS: a system for collecting and analyzing topic-specific Web information , 2000, Comput. Networks.

[25]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[26]  Sholom M. Weiss,et al.  Automated learning of decision rules for text categorization , 1994, TOIS.

[27]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[28]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[29]  Prabhakar Raghavan,et al.  Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies , 1998, The VLDB Journal.