论文信息 - WIDIT in TREC-2003 Web Track

WIDIT in TREC-2003 Web Track

The Web IR experiment of TREC, otherwise known as the Web track, investigated in its initial stages the strategies for the same ad-hoc retrieval task as was done previously with plain text documents. Although many TREC participants explored methods of leveraging non-textual sources of information such as hyperlinks and document structure, the general consensus among the early Web track participants was that link analysis and other non-textual methods did not perform as well as the content-based retrieval methods fine-tuned over the years (Hawking et al., 1999; Hawking et al., 2000; Gurrin & Smeaton, 2001; Savoy & Rasolofo, 2001). There have been many speculations as to why link analysis, which showed much promise in previous research and has been so readily embraced by commercial Web search engines, did not prove useful in Web track experiments. Most such speculations point to potential problems with Web track’s earlier test collections, from the inadequate link structure of truncated Web data (Savoy & Picard, 1998; Singhal & Kazkiel, 2001), and relevance judgments that penalize the link analysis by not counting the hub pages as relevant (Voorhees & Harman, 2000) and boost the content analysis by counting multiple relevant pages from the same site as relevant (Singhal & Kazkiel, 2001), to unrealistic queries that are too detailed and specific to be representative of real world Web searches (Singhal & Kaszkiel, 2001). In an effort to address the criticism and problems associated with the early Web track experiments, TREC abandoned the ad-hoc Web retrieval task in 2002 in favor of topic distillation and named page finding task and replaced its earlier Web test collection of randomly selected Web pages with a larger and potentially higher quality domain-specific collection 1 . The topic distillation task in TREC-2002 is described as finding a short, comprehensive list of pages that are good information resources, and the named page finding tasks is described as finding a specific page whose name is described by the query (Hawking & Craswell, 2002; Craswell & Hawking, 2003). Adjustment of the Web track environment brought forth renewed interest in retrieval approaches that leverage Web-specific sources of evidences such as link structure and document structure. For the home page finding task, where the objective is to find the entry page of a specific site described by the query, Web page’s URL characteristics, such as its type and length, as well as the anchor text of Web page’s inlinks proved to be useful sources of information to be leveraged (Hawking & Craswell, 2002). In the named page finding task, which is similar to home page finding task except that the target page described by the query is not necessarily the entry point of a Web site but any specific page on the Web, the use of anchor text still proved to be an effective strategy but the use of URL characteristics did not work well as it did in the home page finding task (Craswell & Hawking, 2003). In the topic distillation task, anchor text still seemed to be a useful resource, especially as a mean to boost the performance of content-based methods via fusion (i.e. result merging), although the level of its usefulness fell much below that achieved in named page finding tasks (Hawking & Craswell, 2002; Craswell & Hawking, 2003). Various site compression strategies, which attempt to select the “best” pages of a given site, was another common theme in the topic distillation task, once again demonstrating the importance of fine-tuning the retrieval system according to the task at hand (Amitay et al., 2003; Zhang et al., 2003). It is interesting to note that link analysis (e.g. PageRank, HITS variations) has not yet proven itself to be an effective strategy and the content-based method seems to be still the most dominant factor in the Web track. In fact, the two best results in TREC-2002 topic distillation task were achieved by the baseline systems that used only the content-based methods (MacFarlane, 2003; Zhang et al., 2003).

Dan E. Albertson | Kiduk Yang | Kiduk Yang

[1] Amit Singhal,et al. A case study in web search using TREC algorithms , 2001, WWW '01.

[2] Chris Buckley,et al. Using Query Zoning and Correlation Within SMART: TREC 5 , 1996, TREC.

[3] Garrison W. Cottrell,et al. Automatic combination of multiple ranked retrieval systems , 1994, SIGIR '94.

[4] James Allan,et al. Automatic Query Expansion Using SMART: TREC 3 , 1994, TREC.

[5] Jong-Hak Lee,et al. Analyses of multiple evidence combination , 1997, SIGIR '97.

[6] Yiqun Liu,et al. THU TREC2002 Web Track Experiments , 2002 .

[7] Kiduk Yang. Combining Text- and Link-based Retrieval Methods for Web IR , 2001, TREC.

[8] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[9] Peter Bailey,et al. Overview of the TREC-8 Web Track , 2000, TREC.

[10] Ricardo Baeza-Yates,et al. Information Retrieval: Data Structures and Algorithms , 1992 .

[11] David Carmel,et al. Topic Distillation with Knowledge Agents , 2002, TREC.