论文信息 - Statistical phrases for vector-space information retrieval (poster abstract)

Statistical phrases for vector-space information retrieval (poster abstract)

When employing a vector-space model to evaluate a query against a document collection several choices must be made. A fundamental design decision is the definition of the terms which form the dimensions of the space. Should the terms be single words, pairs of words, linguistic phrases, entire sentences, or some other combination of textual units? It seems intuitive that when calculating a measure of similarity between a natural language query text and natural language documents, some respect should be paid to word ordering. Complex terms such as phrases should, therefore, increase the precision of retrieval results. Recent work has, however, shown that this is not the case [8, 41. In this abstract we describe experiments that further confirm that observation. Note that we are solely concerned with statistical phrases; that is, phrases derived using techniques other than NLP. Exploration of phrases a.~ terms in a vector-space based retrieval system has received detailed attention over at least 25 years. Salton et al. [6] show that including statistical phrases as terms in vector-space based retrieval increases precision averaged over 10 recall points by 17% to 39%. These experiments were updated by Fagan in 1989, who used larger document collections [2] (but, at about 10 MB, still small by today’s standards). Fagan reports that average precision improvements range from -11% up to 20%. The downward trend in the impact of statistical phrases on average precision continued in 1997, with Mitra et al. [4] replicatisg Fagan’s experiments on a 655 MB collection, and reporting a 1% precision improvement if phrases are used as terms. This surprising result is also supported in a separate study by Smeaton and Kelledy [8]. Our findings independently confirm these previous results, and add further evidence to the case against the use of phrases as precisionenhancing devices a result that we still find somewhat surprising, since documents and queries are surely more than just bags of words.

Alistair Moffat | Andrew Turpin | A. Turpin | Alistair Moffat

[1] Donna Harman,et al. Overview of the First Text REtrieval Conference. , 1993, SIGIR 1993.

[2] Joel L. Fagan. The effectiveness of a nonsyntatic approach to automatic phrase indexing for document retrieval , 1989 .

[3] Joel L. Fagan,et al. The effectiveness of a nonsyntactic approach to automatic phrase indexing for document retrieval , 1989, JASIS.

[4] Chris Buckley,et al. Pivoted Document Length Normalization , 1996, SIGIR Forum.

[5] Alistair Moffat,et al. Effective document presentation with a locality-based similarity heuristic , 1999, SIGIR '99.

[6] Claire Cardie,et al. An Analysis of Statistical and Syntactic Phrases , 1997, RIAO.

[7] Alistair Moffat,et al. Exploring the similarity space , 1998, SIGF.

[8] Alan F. Smeaton,et al. User-Chosen Phrases in Interactive Query Formulation for Information Retrieval , 1998, BCS-IRSG Annual Colloquium on IR Research.

[9] Clement T. Yu,et al. A theory of term importance in automatic text analysis , 1974, J. Am. Soc. Inf. Sci..