Search Engine-Crawler Symbiosis

Web crawlers have been used for nearly a decade as a search engine component to create and update large collections of documents. Typically the crawler and the rest of the search engine are not closely integrated. If the purpose of a search engine is to have as large a collection as possible to serve the general Web community, a close integration may not be necessary. However, if the search engine caters to a specific community with shared focused interests, it can take advantage of such an integration. In this paper we investigate a tightly coupled system in which the crawler and the search engine engage in a symbiotic relationship. The crawler feeds the search engine and the search engine in turn helps the crawler to better its performance. We show that the symbiosis can help the system learn about a community’s interests and serve such a community with better focus. The search engine-crawler symbiosis is a first step toward a more general model in which we envision truly distributed, collaborative search among Web peers.

[1]  Pattie Maes,et al.  Collaborative Interface Agents , 1994, AAAI.

[2]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[3]  Peter B. Danzig,et al.  Scalable Internet resource discovery: research problems and approaches , 1994, CACM.

[4]  Filippo Menczer,et al.  Topical Crawling for Business Intelligence , 2003, ECDL.

[5]  Sriram Raghavan,et al.  Searching the Web , 2001, ACM Trans. Internet Techn..

[6]  Andrew McCallum,et al.  Automating the Construction of Internet Portals with Machine Learning , 2000, Information Retrieval.

[7]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[8]  Filippo Menczer,et al.  Exploration versus Exploitation in Topic Driven Crawlers , 2002, WebDyn@WWW.

[9]  Paul B. Kantor,et al.  Capturing human intelligence in the net , 2000, CACM.

[10]  Kristian J. Hammond,et al.  Automatically indexing documents: content vs. reference , 2002, IUI '02.

[11]  Kristian J. Hammond,et al.  Reference directed indexing: indexing scientific literature in the context of its use , 2002 .

[12]  Einat Amitay,et al.  Using common hypertext links to identify the best phrasal description of target web documents , 1998 .

[13]  Filippo Menczer,et al.  Topic-Driven Crawlers: Machine Learning Issues , 2002 .

[14]  Natalie S. Glance,et al.  Community search assistant , 2001, IUI '01.

[15]  Filippo Menczer,et al.  Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web , 2000, Machine Learning.

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Munindar P. Singh,et al.  Community-based service location , 2001, CACM.

[18]  Marco Gori,et al.  Focused Crawling Using Context Graphs , 2000, VLDB.

[19]  Kristian J. Hammond,et al.  Guiding people to information: providing an interface to a digital library using reference as a basis for indexing , 2000, IUI '00.

[20]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[21]  Philip S. Yu,et al.  Intelligent crawling on the World Wide Web with arbitrary predicates , 2001, WWW '01.

[22]  Markus Jakobsson,et al.  IntelliShopper: a proactive, personal, private shopping assistant , 2002, AAMAS '02.

[23]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[24]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[25]  Filippo Menczer,et al.  Evaluating topic-driven web crawlers , 2001, SIGIR '01.

[26]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[27]  Reinier Post,et al.  Information Retrieval in the World-Wide Web: Making Client-Based Searching Feasible , 1994, Comput. Networks ISDN Syst..

[28]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[29]  Chaomei Chen,et al.  Mining the Web: Discovering knowledge from hypertext data , 2004, J. Assoc. Inf. Sci. Technol..

[30]  B. Pinkerton,et al.  Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[31]  Yoelle Maarek,et al.  The Shark-Search Algorithm. An Application: Tailored Web Site Mapping , 1998, Comput. Networks.

[32]  C. Lee Giles,et al.  CiteSeer: an automatic citation indexing system , 1998, DL '98.

[33]  Giuseppe Attardi,et al.  Theseus: Categorization by Context , 2000 .

[34]  Taher H. Haveliwala Efficient Computation of PageRank , 1999 .

[35]  Yoav Shoham,et al.  Content-Based, Collaborative Recommendation. , 1997 .

[36]  Yi Qin,et al.  Comparison of two approaches to building a vertical search tool: a case study in the nanotechnology domain , 2002, JCDL '02.

[37]  Brian D. Davison Topical locality in the Web , 2000, SIGIR '00.