Our system used an empirical method for estimating term weights directly from relevance judgements, avoiding various standard but potentially troublesome assumptions. It is common to assume, for example, that weights vary with term frequency ( ) and inverse document frequency ( ) in a particular way, e.g., , but the fact that there are so many variants of this formula in the literature suggests that there remains considerable uncertainty about these assumptions. Our method is a kind of regression method where labeled relevance judgements are fit as a linear combination of (transforms of) , , etc. Training methods not only improve performance, but also extend naturally to include additional factor, that is burstiness. The proposed histogram-based training method provides a simple way to model complicated interactions among factors such as and .
[1]
Kui-Lam Kwok,et al.
A new method of weighting query terms for ad-hoc retrieval
,
1996,
SIGIR '96.
[2]
Fredric C. Gey,et al.
Comparing Multiple Methods for Japanese and Japanese-English Text Retrieval
,
1999,
NTCIR.
[3]
Kenneth Ward Church.
Empirical Estimates of Adaptation: The chance of Two Noriegas is closer to p/2 than p2
,
2000,
COLING.
[4]
Slava M. Katz.
Distribution of content words and phrases in text and language modelling
,
1996,
Natural Language Engineering.
[5]
Kenneth Ward Church,et al.
Empirical Term Weighting and Expansion Frequency
,
2000,
EMNLP.
[6]
Fredric C. Gey,et al.
Full Text Retrieval based on Probalistic Equations with Coefficients fitted by Logistic Regression
,
1993,
TREC.