Using Metadata to Enhance a Web Information Gathering System

With the web at lose to a billion pages and growing at an exponential rate, we are fa ed with the issue of rating pages in terms of quality and trust. In this situation, what other pages say about a web page an be as important as what the page says about itself. The umulative knowledge of these types of re ommendations (or the la k thereof) an be obje tive enough to help a user or robot program to deide whether or not to pursue a web do ument. In addition, these annotations or metadata an be used by a web robot program to derive summary information about web do uments that are written in a language that the robot does not understand. We use this idea to drive a web information gathering system that forms the ore of a topi -spe i sear h engine. In this paper, we des ribe how our system uses annotations about the hyperlinks ontained in web pages to guide itself to rawl the web. It sifts through useful information related to a parti ular topi to eliminate the traversal of links that may not be of interest. Thus, the guided rawling system stays fo used on the target topi . It builds a ri h repository of link information that in ludes annotations. This repository is used to build quality metadata, whi h ultimately serves a sear h engine.

[1]  James Cavanaugh,et al.  Grand Central Station , 1951 .

[2]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[3]  Jason D. M. Rennie,et al.  Building Domain-Speci c Search Engines with Machine Learning Techniques , 1999 .

[4]  Neel Sundaresan,et al.  Metadata based Web mining for relevance , 2000, Proceedings 2000 International Database Engineering and Applications Symposium (Cat. No.PR00789).

[5]  Hector Garcia-Molina,et al.  Efficient Crawling Through URL Ordering , 1998, Comput. Networks.

[6]  Jon M. Kleinberg,et al.  Inferring Web communities from link topology , 1998, HYPERTEXT '98.

[7]  Giles,et al.  Searching the world wide Web , 1998, Science.

[8]  C. Lee Giles,et al.  Accessibility of information on the web , 1999, Nature.

[9]  Marshall Ramsey,et al.  A Smart Itsy Bitsy Spider for the Web , 1998, J. Am. Soc. Inf. Sci..

[10]  Krishna Bharat,et al.  SPHINX: A Framework for Creating Personal, Site-Specific Web Crawlers , 1998, Comput. Networks.

[11]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[12]  Ellen Spertus,et al.  ParaSite: Mining Structural Information on the Web , 1997, Comput. Networks.

[13]  Neel Sundaresan,et al.  Metadata Based Web Mining for Topic-Specific Information Gathering , 2000, EC-Web.

[14]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[15]  Andrew McCallum,et al.  Building Domain-Specific Search Engines with Machine Learning Techniques , 1999 .

[16]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.