Enhanced topic distillation using text, markup tags, and hyperlinks

Topic distillation is the analysis of hyperlink graph structure to identify mutually reinforcing authorities (popular pages) and hubs (comprehensive lists of links to authorities). Topic distillation is becoming common in Web search engines, but the best-known algorithms model the Web graph at a coarse grain, with whole pages as single nodes. Such models may lose vital details in the markup tag structure of the pages, and thus lead to a tightly linked irrelevant subgraph winning over a relatively sparse relevant subgraph, a phenomenon called topic drift or contamination. The problem gets especially severe in the face of increasingly complex pages with navigation panels and advertisement links. We present an enhanced topic distillation algorithm which analyzes text, the markup tag trees that constitute HTML pages, and hyperlinks between pages. It thereby identifies subtrees which have high text- and hyperlink-based coherence w.r.t. the query. These subtrees get preferential treatment in the mutual reinforcement process. Using over 50 queries, 28 from earlier topic distillation work, we analyzed over 700,000 pages and obtained quantitative and anecdotal evidence that the new algorithm reduces topic drift.

[1]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[2]  Claire Cardie,et al.  Using clustering and SuperConcepts within SMART: TREC 6 , 1997, Inf. Process. Manag..

[3]  Andrew Smith,et al.  Detecting Subject Boundaries Within Text: A Language Independent Statistical Approach , 1997, EMNLP.

[4]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[5]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[6]  Jacques Savoy,et al.  An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems , 1996, Inf. Process. Manag..

[7]  Marti A. Hearst Multi-Paragraph Segmentation Expository Text , 1994, ACL.

[8]  Mitchell P. Marcus,et al.  Topic segmentation: algorithms and applications , 1998 .

[9]  Allan Borodin,et al.  Finding authorities and hubs from link structures on the World Wide Web , 2001, WWW '01.

[10]  W. Bruce Croft,et al.  Text Segmentation by Topic , 1997, ECDL.

[11]  D. S. Johnson,et al.  On Knapsacks, Partitions, and a New Dynamic Programming Technique for Trees , 1983, Math. Oper. Res..

[12]  Krishna Bharat,et al.  Improved algorithms for topic distillation in a hyperlinked environment , 1998, SIGIR '98.

[13]  Jon M. Kleinberg,et al.  Clustering categorical data: an approach based on dynamical systems , 2000, The VLDB Journal.

[14]  Soumen Chakrabarti,et al.  Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction , 2001, WWW '01.

[15]  Jon M. Kleinberg,et al.  Mining the Web's Link Structure , 1999, Computer.

[16]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .