SMART High Precision: TREC 7

The Smart information retrieval project emphasizes completely automatic approaches to the understanding and retrieval of large quantities of text. We continue our work in TREC 7, concentrating on high precision retrieval. In particular, we present an in-depth analysis of our High-Precision Track results, including o ering evaluation approaches and measures for time dependent evaluation. We participated in the Query Track, making initial e orts at analyzing query variability, one of the major obstacles for improving retrieval e ectiveness. Basic Indexing and Retrieval In the Smart system, the vector-processing model of retrieval is used to transform both the available information requests as well as the stored documents into vectors of the form: Di = (wi1; wi2; : : : ; wit) where Di represents a document (or query) text and wik is the weight of term Tk in document Di. A weight of zero is used for terms that are absent from a particular document, and positive weights characterize terms actually assigned. The assumption is that t terms in all are available for the representation of the information. The basic \tf*idf" weighting schemes used within SMART have been discussed many times. For TREC 7 we use the same basic weights and document length normalization as were developed at Cornell by Amit Singhal for TREC 4([3, 5]. Tests on various collections show that this indexing is reasonably collection independent and thus should be valid across a wide range of new collections. No human expertise in the subject matter is required for either the initial collection creation, or the actual query formulation. The same phrase strategy (and phrases) used in all previous TRECs (for example [2, 3, 4, 1]) are used for TREC 7. Any pair of adjacent non-stopwords is regarded as a potential phrase. The nal list of phrases is composed of those pairs of words occurring in 25 or more documents of the initial TREC 1 document set. Phrases are weighted with the same scheme as single terms. When the text of document Di is represented by a vector of the form (di1; di2; : : : ; dit) and query Qj by the vector (qj1; qj2; : : : ; qjt), a similarity (S) computation between the two items can conveniently be obtained as the inner product between corresponding weighted term vectors as follows: