Proximity of query terms in a document is an important criterion in IR. However, no investigation has been made to determine the most useful term sequences for which proximity should be considered. In this study, we test the effectiveness of using proximity of partial term sequences (n-grams) for Web search. We observe that the proximity of sequences of 3 to 5 terms is most effective for long queries, while shorter or longer sequences appear less useful. This suggests that combinations of 3 to 5 terms can best capture the intention in user queries. In addition, we also experiment with weighing the importance of query sub-sequences using query log frequencies. Our preliminary tests show promising empirical results.
[1]
Tao Tao,et al.
An exploration of proximity measures in information retrieval
,
2007,
SIGIR.
[2]
W. Bruce Croft,et al.
A Markov random field model for term dependencies
,
2005,
SIGIR '05.
[3]
Jaana Kekäläinen,et al.
Cumulated gain-based evaluation of IR techniques
,
2002,
TOIS.
[4]
J. Friedman.
Greedy function approximation: A gradient boosting machine.
,
2001
.