Collection Selection Based on Historical Performance for Efficient Processing

A Grid Information Retrieval (GIR) simulation was used to process the TREC Million Query Track queries. The GOV2 collection was partitioned by hostname and the aggregate performance of each host, as measured by qrel counts from the past TREC Terabyte Tracks, was used to rank the hosts in order of quality. Only the highest quality hosts were included in the Grid IR simulation; the hosts selected represent less than 20% of all GOV2 documents. The IR performance of the GIR simulation, as measured by the topic-averaged AP, b-pref, and Rel@10 over the TREC Terabyte-Track topics is within one standard deviation of the respective topic-averaged TREC Million Query participant median scores. Estimated AP of the Million Query topic results is comparable to the topic-averaged AP of the Terabyte topic results.

[1]  Charles L. A. Clarke,et al.  Overview of the TREC 2004 Terabyte Track , 2004, TREC.

[2]  Nassib Nassar,et al.  Amberfish at the TREC 2004 Terabyte Track , 2004, Text Retrieval Conference.

[3]  J. Aslam,et al.  A Practical Sampling Strategy for Efficient Retrieval Evaluation , 2007 .

[4]  Gregory B. Newby,et al.  Distributed web search efficiency by truncating results , 2007, JCDL '07.

[5]  Gregory B. Newby,et al.  Logistic Regression Merging of Amberfish and Lucene Multisearch Results , 2005, TREC.

[6]  C. J. van Rijsbergen,et al.  Report on the need for and provision of an 'ideal' information retrieval test collection , 1975 .

[7]  Jacques Savoy,et al.  Database merging strategy based on logistic regression , 2000, Inf. Process. Manag..

[8]  David Dubin,et al.  Structure in document browsing spaces , 1997 .

[9]  Gregory B. Newby,et al.  Partitioning the Gov2 Corpus by Internet Domain Name: A Result-set Merging Experiment , 2006, TREC.

[10]  King-Lup Liu,et al.  Building efficient and effective metasearch engines , 2002, CSUR.

[11]  Gerald Salton,et al.  Automatic text processing , 1988 .

[12]  James Allan,et al.  Minimal test collections for retrieval evaluation , 2006, SIGIR.

[13]  Luo Si Federated search of text search engines in uncooperative environments , 2007, SIGF.

[14]  John W. Sammon SOME MATHEMATICS OF INFORMATION STORAGE AND RETRIEVAL , 1968 .

[15]  Charles L. A. Clarke,et al.  The TREC 2005 Terabyte Track , 2005, TREC.

[16]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.