Due to the ever increasing aaordability and accessibility of very large, online, text collections , processing natural language texts for search and retrieval has recently been the focus of heightened attention, although researchers have been active in the eld since the early sixties. Numerous approaches have been attempted, but they all suuer from the obvious diiculty that search and retrieval is quintessentially a cognitive task; the degree of automatic language understanding required for a completely automatic solution is clearly beyond the bounds of current technology. Instead, heuristic search techniques attempt to match an admittedly incomplete query description with an admittedly incomplete set of features extracted from the texts of interest. The challenge therefore lies in the development of procedures that more eeectively bridge the gap between an individual's partially stated desires and a universe of text, which typically appears, computationally, as a sequence of uninterpreted words. Many of these procedures are statistical in nature; they take advantage of repeated occurrences of the same word to infer relations between documents, and between queries and documents. 1 For example, similarity search induces a \relevance" ordering on the text collection by scoring each document with a normalized sum of importance weights assigned to each word in common between it and the query, where the importance weights depend upon document and collection, or corpus, frequencies 28]. A more formal approach scores documents with their estimated probability of relevance to the query by adopting a text model which assumes word occurrences are sequentially uncorrelated and training on a set of known relevant documents 3, 30]. In contrast, polysemy (one word having multiple senses) and word correlation is directly addressed by Latent Semantic Indexing, which attempts to exact characteristic linear combinations through a singular value decomposition of a word co-occurrence matrix 12]. The availability of interdocument similarity measures suggests clustering, which has been pursued both as an accelerator for conventional search and as a query broadening tool 31]. Finally, linear discriminant analysis has been deployed to 1 Here \document" need not correspond to any particular organization. It might be a chapter within a book, a section within a chapter, or an individual paragraph. In the following we will assume that the set of documents forming a corpus is an exhaustive and disjoint partition of that corpus.
[1]
Gary Marchionini.
Information-seeking strategies of novices using a full-text electronic encyclopedia
,
1989
.
[2]
Donald Hindle,et al.
Noun Classification From Predicate-Argument Structures
,
1990,
ACL.
[3]
M. V. Wilkes,et al.
The Art of Computer Programming, Volume 3, Sorting and Searching
,
1974
.
[4]
Gary Marchionini,et al.
Information-seeking strategies of novices using a full-text electronic encyclopedia
,
1989,
JASIS.
[5]
Kimmo Koskenniemi,et al.
A Compiler for Two-level Phonological Rules
,
1987
.
[6]
Don R. Swanson,et al.
Probabilistic models for automatic indexing
,
1974,
J. Am. Soc. Inf. Sci..
[7]
Jan O. Pedersen,et al.
An object-oriented architecture for text retrieval
,
1991,
RIAO.
[8]
Kenneth Ward Church.
A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text
,
1988,
ANLP.
[9]
M. E. Maron,et al.
An evaluation of retrieval effectiveness for a full-text document-retrieval system
,
1985,
CACM.
[10]
John B. Carroll,et al.
The American Heritage Word Frequency Book
,
1971
.
[11]
J. Baker.
Trainable grammars for speech recognition
,
1979
.
[12]
Michael Riley,et al.
Some Applications of Tree-based Modelling to Speech and Language
,
1989,
HLT.
[13]
Gerard Salton,et al.
A vector space model for automatic indexing
,
1975,
CACM.
[14]
Gerald Salton,et al.
Automatic text processing
,
1988
.
[15]
Gerard Salton,et al.
Improving retrieval performance by relevance feedback
,
1997,
J. Am. Soc. Inf. Sci..
[16]
Julian Kupiec,et al.
Augmenting a Hidden Markov Model for Phrase-Dependent Word Tagging
,
1989,
HLT.
[17]
Hans-Peter Frei,et al.
Caliban: its user-interface and retrieval algorithm
,
1985
.
[18]
Jan O. Pedersen,et al.
Optimization for dynamic inverted index maintenance
,
1989,
SIGIR '90.
[19]
Peter Willett,et al.
Recent trends in hierarchic document clustering: A critical review
,
1988,
Inf. Process. Manag..