As part of the TREC 2005 Terabyte track, we conducted a range of experiments investigating the effects of larger collections. Our main findings can be summarized as follows. First, we tested whether our retrieval system scales up to terabyte-scale collections. We found that our retrieval system can handle 25 million documents, although in terms of indexing time we are approaching the limits of a non-distributed retrieval system. Second, we hoped to find out whether results from earlier Web Tracks carry over to this task. For known-item search we found that, on the one hand, indegree and URL priors did not promote retrieval effectiveness, but that, on the other hand, the combination of different document representations improved retrieval effectiveness. Third, we investigated the role of smoothing for collections of this size. We found that larger collections require far less smoothing, especially for the adhoc task using very little smoothing leads to substantial gains in retrieval effectiveness.
[1]
Gilad Mishne,et al.
Language Models for Searching in Web Corpora
,
2004,
TREC.
[2]
Xie Kanglin.
Lucene Search Engine
,
2007
.
[3]
Stephen E. Robertson,et al.
On Collection Size and Retrieval Effectiveness
,
2004,
Information Retrieval.
[4]
John D. Lafferty,et al.
A study of smoothing methods for language models applied to Ad Hoc information retrieval
,
2001,
SIGIR '01.
[5]
Jaap Kamps,et al.
Web-centric language models
,
2005,
CIKM '05.
[6]
M. de Rijke,et al.
Approaches to Robust and Web Retrieval
,
2003,
TREC.