Distributed web search efficiency by truncating results

A large set of Web documents (the TREC GOV2 collection) comes from many separate Internet hosts, such as www.nih.gov and travel.state.gov. There is considerable variability in the number of Web pages (i.e., documents) from each host. In this paper, we present and evaluate a method for setting a maximum number of "hits" that may be presented for each web host. Federated search environments are increasingly common components of digital libraries and in these environments, the benefit of such a maximum is that it can reduce the number of possibly relevant documents presented by each subcollection, without hurting early precision measures such as P@20. Derivation of a maximum number, which is proportional to the subcollection size but not sensitive to different search topics, is made possible by an analysis of patterns of relevance judgment across approximately 17,000 web hosts in GOV2.

[1]  B. Schapiro,et al.  Zipf 's law and the effect of ranking on probability distributions , 1996 .

[2]  Ellen M. Voorhees,et al.  Overview of TREC 2004 , 2004, TREC.

[3]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[4]  Hsinchun Chen,et al.  Exploring the Dark Side of the Web: Collection and Analysis of U.S. Extremist Online Forums , 2006, ISI.

[5]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[6]  Lada A. Adamic,et al.  Internet: Growth dynamics of the World-Wide Web , 1999, Nature.

[7]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[8]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[9]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[10]  Ellen M. Voorhees,et al.  Evaluation by highly relevant documents , 2001, SIGIR '01.

[11]  Ellen M. Voorhees,et al.  Overview of the TREC 2006 , 2007, TREC.

[12]  Gregory B. Newby,et al.  Partitioning the Gov2 Corpus by Internet Domain Name: A Result-set Merging Experiment , 2006, TREC.

[13]  Brewster Kahle,et al.  An information system for corporate users: wide area information servers , 1991 .

[14]  Nancy Garman Meta Search Engines. , 1999 .

[15]  Nassib Nassar,et al.  Amberfish at the TREC 2004 Terabyte Track , 2004, Text Retrieval Conference.

[16]  Ellen M. Voorhees,et al.  Siemens TREC-4 Report: Further Experiments with Database Merging , 1995, TREC.

[17]  Sriram Raghavan,et al.  Crawling the Hidden Web , 2001, VLDB.

[18]  Fredric C. Gey,et al.  Inferring probability of relevance using the method of logistic regression , 1994, SIGIR '94.

[19]  Venkata Subramaniam,et al.  Information Retrieval: Data Structures & Algorithms , 1992 .

[20]  Gerald Kowalski,et al.  Information Retrieval Systems: Theory and Implementation , 1997 .

[21]  Radha Radhakrishnan Information retrieval at Boeing: plans and successes , 2006, SIGIR '06.

[22]  Christopher J. C. Burges,et al.  High accuracy retrieval with multiple nested ranker , 2006, SIGIR.

[23]  Charles L. A. Clarke,et al.  The TREC 2006 Terabyte Track , 2006, TREC.