Abstract : IRRA (IR-Ra) group participated in the 2012 Web track, with a system implementing a non-parametric term weighting method based on measuring the divergence from independence (DFI). This is the third year of participation for IRRA group, following the participations in TREC 2009 and 2010 Web tracks. In this year, the aim is to evaluate a new DFI-based term weighting model developed on the basis of Shannon s information theory (Shannon, 1949), along with the evaluation of a heuristic approach that is expected to provide early precision when used together with DFI term weighting. The TERRIER retrieval platform version 3.0 (Ounis et al., 2007) is used to index and search the ClueWeb09-T09B1 data set ( Category B data set), a subset of about 50 million Web pages in English. During indexing and searching, terms are stemmed (Porter s stemmer as implemented in TERRIER) but not stopped. The result sets are filtered using the fusion of two spam-page lists provided by Cormack et al. (2010) for ClueWeb09 document collection.
[1]
C. J. van Rijsbergen,et al.
Probabilistic models of information retrieval based on measuring the divergence from randomness
,
2002,
TOIS.
[2]
Yoichiro Takada.
ON THE MATHEMATICAL THEORY OF COMMUNICATION
,
1954
.
[3]
Charles L. A. Clarke,et al.
Efficient and effective spam filtering and re-ranking for large web datasets
,
2010,
Information Retrieval.
[4]
Tao Tao,et al.
A formal study of information retrieval heuristics
,
2004,
SIGIR '04.
[5]
Sang Joon Kim,et al.
A Mathematical Theory of Communication
,
2006
.
[6]
Éric Gaussier,et al.
Information-based models for ad hoc IR
,
2010,
SIGIR '10.