Server selection methods in hybrid portal search

The TREC.GOV collection makes a valuable web testbed for distributed information retrieval methods because it is naturally partitioned and includes 725 web-oriented queries with judged answers. It can usefully model aspects of government and large corporate portals. Analysis of the.gov data shows that a purely distributed approach would not be feasible for providing search on a.gov portal because of the large number (17,000+) of web sites and the high proportion that do not provide a search interface. An alternative hybrid approach, combining both distributed and centralized techniques, is proposed and server selection methods are evaluated within this framework using web-oriented evaluation methodology. A number of well-known algorithms are compared against representatives (highest anchor ranked page (HARP) and anchor weighted sum (AWSUM)) of a family of new selection methods which use link anchortext extracted from an auxiliary crawl to provide descriptions of sites which are not themselves crawled. Of the previously published methods, ReDDE substantially outperformed three variants of CORI and also outperformed a method based on Kullback-Leibler Divergence (extended) except on topic distillation. HARP and AWSUM performed best overall but were outperformed on the topic distillation task by extended KL Divergence.

[1]  Stephen E. Robertson,et al.  Effective site finding using link anchor information , 2001, SIGIR '01.

[2]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[3]  David Hawking,et al.  Overview of the TREC 2003 Web Track , 2003, TREC.

[4]  Shlomo Moran,et al.  Optimizing Result Prefetching in Web Search Engines with Segmented Indices , 2002, VLDB.

[5]  Monika Henzinger,et al.  Analysis of a very large web search engine query log , 1999, SIGF.

[6]  James C. French,et al.  Comparing the performance of database selection algorithms , 1999, SIGIR '99.

[7]  David Hawking,et al.  Automated Discovery of Search Interfaces on the Web , 2003, ADC.

[8]  Luis Gravano,et al.  When one sample is not enough: improving text database selection using shrinkage , 2004, SIGMOD '04.

[9]  Andrei Broder,et al.  A taxonomy of web search , 2002, SIGF.

[10]  Norbert Fuhr,et al.  Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection , 2004, ECIR.

[11]  Luis Gravano,et al.  GlOSS: text-source discovery over the Internet , 1999, TODS.

[12]  Peter Bailey,et al.  Server selection on the World Wide Web , 2000, DL '00.

[13]  Amit Singhal,et al.  A case study in web search using TREC algorithms , 2001, WWW '01.

[14]  W. Bruce Croft,et al.  Searching distributed collections with inference networks , 1995, SIGIR '95.

[15]  Luo Si,et al.  A language modeling framework for resource selection and results merging , 2002, CIKM '02.

[16]  James C. French,et al.  Evaluating database selection techniques: a testbed and experiment , 1998, SIGIR '98.

[17]  Alistair Moffat,et al.  Performance and Cost Tradeoffs in Web Search , 2004, ADC.

[18]  Lada A. Adamic,et al.  Evolutionary Dynamics of the World Wide Web , 1999 .

[19]  Mark S. Ackerman,et al.  The perfect search engine is not enough: a study of orienteering behavior in directed search , 2004, CHI.

[20]  James C. French,et al.  Comparing the performance of collection selection algorithms , 2003, TOIS.

[21]  Luo Si,et al.  The Effect of Database Size Distribution on Resource Selection Algorithms , 2003, Distributed Multimedia Information Retrieval.

[22]  David Hawking,et al.  Toward better weighting of anchors , 2004, SIGIR '04.

[23]  James P. Callan,et al.  Automatic discovery of language models for text databases , 1999, SIGMOD '99.

[24]  Jacques Savoy,et al.  Approaches to collection selection and results merging for distributed information retrieval , 2001, CIKM '01.