论文信息 - Using the Structure of HTML Documents to Improve Retrieval

Using the Structure of HTML Documents to Improve Retrieval

The World Wide Web (WWW) is a gigantic information resource, which is growing daily. As more and more data are added to the WWW, it is becoming increasingly difficult to effectively locate useful information from this environment. In this paper, we propose a method for making use of the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. Our study assigns the occurrences of terms in a document collection into six classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). Based on the assignment, we extend the weighting schemes in traditional information retrieval by incorporating different importance factors to terms in different classes. The rationale is that terms appearing in different places of a document may have different significance in identifying the document. For this research we have built a Web based search tool, Webor, created a testbed, and conducted extensive experiments to determine an optimal class importance factor combination. Our study indicates that substantial improvement of retrieval effectiveness can be achieved using this technique.

[1] B. Pinkerton,et al. Finding What People Want : Experiences with the WebCrawler , 1994, WWW Spring 1994.

[2] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[3] E. Frisse Mark,et al. Searching for information in a hypertext medical handbook , 1988 .

[4] Mark E. Frisse,et al. Searching for information in a hypertext medical handbook , 1987, Commun. ACM.

[5] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[6] Dik Lun Lee,et al. Search and ranking algorithms for locating resources on the World Wide Web , 1996, Proceedings of the Twelfth International Conference on Data Engineering.

[7] W. Bruce Croft,et al. Retrieval Strategies for Hypertext , 1993, Inf. Process. Manag..

[8] Dik Lun Lee,et al. WISE: A World Wide Web Resource Database System , 1996, IEEE Trans. Knowl. Data Eng..

[9] Craig A. Knoblock,et al. Lycos : Design choices in an Internet search service , 1997 .

[10] Maristella Agosti,et al. Information Retrieval and Hypertext , 1996, Information Retrieval and Hypertext.

[11] Hans-Peter Frei,et al. Making use of hypertext links when retrieving information , 1992, ECHT '92.

[12] Mark D. Dunlop,et al. Hypermedia and Free Text Retrieval , 1993, Inf. Process. Manag..

[13] Chanathip Namprempre,et al. HyPursuit: a hierarchical network search engine that exploits content-link hypertext clustering , 1996, HYPERTEXT '96.