Topic distillation on hierarchically categorized Web documents

As an alternative to search capability, many search engines are providing directory servers containing categorized Web documents for users to navigate and browse through. We are investigating three issues in portal site construction given a large collection of categorized Web documents: (1) distillation of important topics for each category of documents; (2) distillation of important documents/sites for these topics; and (3) automation of these two tasks. We have developed an automated technique for topics and Web site distillation. Our technique integrates Web document content analysis and link structure analysis. It considers local importance of keywords and their global distribution statistics on a given Web document category hierarchy.

[1]  Ravi Kumar,et al.  Trawling the Web for Emerging Cyber-Communities , 1999, Comput. Networks.

[2]  W. Bruce Croft,et al.  Query expansion using local and global document analysis , 1996, SIGIR '96.

[3]  Israel Ben-Shaul,et al.  Adding Support for Dynamic and Focused Search with Fetuccino , 1999, Comput. Networks.

[4]  K. Selçuk Candan,et al.  Integrating content search with structure analysis for hypermedia retrieval and management , 1999, CSUR.

[5]  Martin van den Berg,et al.  Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery , 1999, Comput. Networks.

[6]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[7]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[8]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[9]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[10]  Monika Henzinger,et al.  Finding Related Pages in the World Wide Web , 1999, Comput. Networks.

[11]  Peter Willett,et al.  Readings in information retrieval , 1997 .

[12]  Piotr Indyk,et al.  Enhanced hypertext categorization using hyperlinks , 1998, SIGMOD '98.

[13]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[14]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[15]  Philip M. Turner,et al.  Automatic linking of thesauri , 1996, SIGIR '96.