论文信息 - Snippet Search: a Single Phrase Approach to Text Access

Snippet Search: a Single Phrase Approach to Text Access

Due to the ever increasing aaordability and accessibility of very large, online, text collections , processing natural language texts for search and retrieval has recently been the focus of heightened attention, although researchers have been active in the eld since the early sixties. Numerous approaches have been attempted, but they all suuer from the obvious diiculty that search and retrieval is quintessentially a cognitive task; the degree of automatic language understanding required for a completely automatic solution is clearly beyond the bounds of current technology. Instead, heuristic search techniques attempt to match an admittedly incomplete query description with an admittedly incomplete set of features extracted from the texts of interest. The challenge therefore lies in the development of procedures that more eeectively bridge the gap between an individual's partially stated desires and a universe of text, which typically appears, computationally, as a sequence of uninterpreted words. Many of these procedures are statistical in nature; they take advantage of repeated occurrences of the same word to infer relations between documents, and between queries and documents. 1 For example, similarity search induces a \relevance" ordering on the text collection by scoring each document with a normalized sum of importance weights assigned to each word in common between it and the query, where the importance weights depend upon document and collection, or corpus, frequencies 28]. A more formal approach scores documents with their estimated probability of relevance to the query by adopting a text model which assumes word occurrences are sequentially uncorrelated and training on a set of known relevant documents 3, 30]. In contrast, polysemy (one word having multiple senses) and word correlation is directly addressed by Latent Semantic Indexing, which attempts to exact characteristic linear combinations through a singular value decomposition of a word co-occurrence matrix 12]. The availability of interdocument similarity measures suggests clustering, which has been pursued both as an accelerator for conventional search and as a query broadening tool 31]. Finally, linear discriminant analysis has been deployed to 1 Here \document" need not correspond to any particular organization. It might be a chapter within a book, a section within a chapter, or an individual paragraph. In the following we will assume that the set of documents forming a corpus is an exhaustive and disjoint partition of that corpus.

Jan O. Pedersen | John W. Tukey | Doug Cutting | J. Tukey | D. Cutting