The robust retrieval track is a new track in TREC 2003. The goal of the track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task to TREC that provides a natural home for new participants. An important component of effectiveness for commercial retrieval systems is the ability of the system to return reasonable results for every topic. Users remember abject failures. A relatively few such failures cause the user to mistrust the system and discontinue use. Yet the standard retrieval evaluation paradigm based on averages over sets of topics does not significantly penalize systems for failed topics. The robust retrieval track looks to improve the consistency of retrieval technology by focusing on poorly performing topics. The task within the track was a traditional ad hoc task. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic. In addition to the standard evaluation by trec eval, each run was also evaluated using two new effectiveness measures that focus on the effectiveness of the least-well-performing topics. This paper presents an overview of the results of the track. The first section provides more details regarding the task and defines the new evaluation measures. The following section presents the systems’ retrieval results, while Section 3 examines the new evaluation measures. Systems compare differently when evaluated on the new measures then when evaluated on standard measures such as MAP, suggesting that the new measures capture a different aspect of retrieval behavior. However, the measures are less stable than the traditional measures, and the marigin of error associated with the new measures is large relative to the differences in scores observed in the track. 1 The Robust Retrieval Task As noted above, the task within the robust retrieval track was a traditional ad hoc task. The topic set consisted of a total of 100 topics, 50 old topics taken from TREC topics 301–450 (TRECs 6–8) and 50 new topics. The document collection was the set of documents on TREC disks 4 and 5, minus the Congressional Record, since that is what was used for TRECs 6–8. This document set contains approximately 528,000 documents and 1,904 MB of text. Since the focus of the track is on poorly performing topics, we wanted to ensure that there were topics that are generally difficult for systems to answer in the test set. We could not (purposely) construct a difficult topic set using only new topics since it is notoriously hard to predict whether or not a topic will be difficult a priori [5]. Instead, we used the effectiveness of the retrieval runs in TRECs 6–8 to construct a topic set of known-to-be-difficult topics. For each of topics 301–450, NIST created a box plot of the average precision scores for all runs (both automatic and manual) submitted to the ad hoc task in that topic’s TREC. NIST then selected topics with low median average precision scores but with at least one (there was usually more than one) high outlier. The requirement for at least one system doing well on the topic was designed to eliminate flawed topics from the topic set. The set of old topics selected for the robust track is given in Figure 1. While using old topics allowed NIST to construct a test set with certain properties, it also meant that full relevance data for these topics was available to the participants, and that systems were likely developed using those topics. NIST therefore created 50 new topics using the standard topic creation process as a type of control group. The 50 new topics are numbered 601–650. Since we could not control how the old topics had been used in the past, the assumption was that the old topics were fully exploited in any way desired in the construction of a participants’ retrieval system. In 303 322 344 353 363 378 394 408 426 439 307 325 345 354 367 379 397 409 427 442 310 330 346 355 372 383 399 414 433 443 314 336 347 356 374 389 401 416 435 445 320 341 350 362 375 393 404 419 436 448 Figure 1: The set of old topics used in the robust track. other words, participants were allowed to explicitly train on the 50 old topics in the test set if they desired to. The only restriction placed on the use of relevance data for the 50 old topics was that the relevance judgments could not be used during the processing of the submitted runs. This precluded such things as true (rather than pseudo) relevance feedback and computing weights based on the known relevant set. The existing relevance judgments were used for the old topics; no new judgments of any kind were made for these topics. The new topics were judged by creating pools from all runs submitted to the track and using the top 125 documents per run. There was an average of 959 documents judged for each new topic. The assessors made three-way judgments of not relevant, relevant, or highly relevant for the new topics. Seven of the 50 new topics had no highly relevant documents, and another 14 topics had fewer than 5 highly relevant documents. All the evaluation results reported for the track consider both relevant and highly relevant documents as the relevant set since there are no highly relevant judgments for the old set. The number of relevant documents per topic for the old topic set ranged from a low of 5 to a high of 361 and an average of 88. For the new topic set, the minimum number of relevant was 4, the maximum was 115, and the average was 33. While no new judgments were made for the old topics, we did form pools for those topics (using the top 100 retrieved per run) to examine the coverage of the original judgment set. Across the set of 50 old topics, an average of 61.4 % (minimum 43.2 %, maximum 79.7 %) of the documents in the pools created using robust track runs were judged. A relatively low number of judged documents is to be expected since the old topics were chosen because they were difficult, and there is known to be less overlap among the retrieved sets for difficult topics than for easier topics. Across the 78 runs that were submitted to the track, there was an average of 0.4 unjudged documents in the top 10 documents retrieved and 11.6 unjudged documents in the top 100 retrieved. These averages are inflated by a set of five runs that had very poor effectiveness (a cursory examination confirmed that the poor effectiveness was caused by retrieving documents that were indeed not relevant). Without these five runs, there was an average of 0.2 unjudged documents in the top 10 documents retrieved and 8.7 unjudged documents in the top 100 retrieved. There is still a tendency toward poorer runs having larger numbers of unjudged documents in the retrieved set, but such a bias is expected and is caused by poorer runs retrieving different, really-not-relevant documents. Runs were evaluated using trec eval, with average scores computed over the set of 50 old topics, the set of 50 new topics, and the combined set of 100 topics. Two additional measures were computed over the same three topic sets. The first measure was the percentage of topics that retrieved no relevant documents in the top ten retrieved. If one accepts “no relevant documents in the top ten retrieved” as an adequate definition of poorly performing topic, then this is a direct measure of the behavior of interest and is therefore a very intuitive and easily understood measure. It has the drawback of being a very coarse measure. That is, there are relatively few discrete values the measure can assume in theory, and the actual range of values seen in practice is much smaller than the theoretical range. The second measure was suggested by Chris Buckley. One of the initial proposals for a measure for the track was to compute the mean of the average precision scores (MAP) for the system’s worst topics (as measured by average precision) rather than the entire set of topics as trec eval does. In an attempt to pick a suitable —big enough to make the measure stable but small enough to emphasize the poorly performing topics—the mean average precision over the worst topics, MAP( ), was plotted as a function of for several runs. Chris suggested that instead of picking a single point on the curve to use as the measure, to use the area underneath the MAP( ) vs. curve as the measure. Just as MAP (the area underneath the recall-precision curve) emphasizes high precision but has a recall component, the area under the MAP( ) vs. curve measure emphasizes the worst-performing topics, but also gives a general measure of quality. The measure as implemented for the track computes the area under the MAP( ) vs. curve, but limits to the worst quarter topics. That is, is set from for the 50-topic sets and for the combined set. This measure is not exactly intuitive (it doesn’t even have a better name than “area underneath the MAP( ) vs. curve” yet), but it incorporates much more information than the percentage of topics with no relevant Table 1: Groups participating in the robust track. Chinese Academy of Sciences (CAS-NLPR) Tsinghua University (Ma) Fondazione Ugo Bordoni University of Amsterdam Hummingbird University of Glasgow Johns Hopkins University/APL University of Illinois at Chicago OcE Technologies University of Illinois at Urbana-Champaign Queens College, CUNY University of Melbourne Rutgers University (Neu) University of Waterloo (MultiText) Sabir Research, Inc. Virginia Tech in the top 10 retrieved. Note that since the measure is computed over the individual system’s worst topics, different systems’ scores are computed over a different set of topics in general.
[1]
Donna K. Harman,et al.
The NRRC reliable information access (RIA) workshop
,
2004,
SIGIR '04.
[2]
Kui-Lam Kwok,et al.
TREC 2004 Robust Track Experiments Using PIRCS
,
2004,
TREC.
[3]
Ellen M. Voorhees,et al.
Overview of the TREC 2002 Question Answering Track
,
2003,
TREC.
[4]
Stephen E. Robertson,et al.
On Collection Size and Retrieval Effectiveness
,
2004,
Information Retrieval.
[5]
Jin Xu,et al.
NLPR at TREC 2004: Robust Experiments
,
2004,
TREC.
[6]
Clement T. Yu,et al.
UIC at TREC 2004: Robust Track
,
2004,
TREC.
[7]
Ning Yu,et al.
WIDIT in TREC 2004 Genomics, Hard, Robust and Web Tracks
,
2004,
TREC.
[8]
Donna K. Harman,et al.
Overview of the Sixth Text REtrieval Conference (TREC-6)
,
1997,
Inf. Process. Manag..
[9]
Ellen M. Voorhees,et al.
The effect of topic set size on retrieval experiment error
,
2002,
SIGIR '02.
[10]
Christine D. Piatko,et al.
JHU/APL at TREC 2004: Robust and Terabyte Tracks
,
2004,
TREC.
[11]
Justin Zobel,et al.
How reliable are the results of large-scale information retrieval experiments?
,
1998,
SIGIR '98.
[12]
Elad Yom-Tov,et al.
Juru at TREC 2004: Experiments with Prediction of Query Difficulty
,
2004,
TREC.
[13]
W. Bruce Croft,et al.
Predicting query performance
,
2002,
SIGIR '02.
[14]
Chris Buckley.
Looking at Limits and Tradeoffs: Sabir Research at TREC 2005
,
2005,
TREC.
[15]
Chris Buckley.
Why current IR engines fail
,
2004,
SIGIR '04.
[16]
Iadh Ounis,et al.
University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier
,
2004,
TREC.
[17]
Ellen M. Voorhees,et al.
Evaluating evaluation measure stability
,
2000,
SIGIR '00.
[18]
Claudio Carpineto,et al.
Fondazione Ugo Bordoni at TREC 2004
,
2004,
TREC.
[19]
Elad Yom-Tov,et al.
Juru at TREC 2005: Query Prediction in the Terabyte and the Robust Tracks
,
2005,
TREC.