Using Virtual Document for NTCIR-4 Web Information Retrieval Task

The Web is a large collection of heterogeneous pages. Web documents are not always descriptive and accurate in content. In addition, a significant difference between the problems of Web search and traditional text search is the availability of hyperlinks between pages. A page on the Web might possibly be cited by or cite other pages. When evaluating a page, the neighborhood of the page might be a part of the input. In this paper, in addition to the explicit information unit (page content), a new information unit, a virtual document, is introduced in our systems, which is mainly organized by the associated anchor-text of in-bounds links to a page and also its title data. We analyzed the utility of virtual document for Web searching. Three searching function based on virtual document are proposed in our study: † We propose a way to weight query terms through term entropy in the virtual document collection space.