Incorporation of the age of a document into the retrieval process

Abstract A full treatment of the significance of a document for an enquirer should include a joint description of the similarity between the document and the enquiry in a linquistic sense, and the age of the document at the time of the enquiry. The basic variables are identified in terms of a signal detection model. The age variable is related to the phenomenon of obsolescence, which is treated as a perceived, signed attribute of relevant documents. Two retrieval methods that use both index terms and document age are described: one in which a set of documents, first selected by a term-intersection process, is reduced by applying a date of publication criterion (the “subset method”); and one in which a bivariate function attaches a single number to each document, and a retrieved set is defined by a single threshold value (the “bivariate weight method”). In the latter method, discriminant analysis is a useful aid. A model of the retrieval process, based on continuous variables, is described, and the effectiveness of each method is predicted, both in terms of the Precision-Recall graph and language measures. The model suggests that either method can improve retrieval performance but incorrect usage will depress it. The better choice of method will depend on the Recall/Precision mix required by the user, as well as the actual parameters of the distributions. A relationship is hypothesised between the growth rate of a data base and the underlying distributions defined by relevance judgements.

[1]  T. W. Anderson,et al.  Classification into two Multivariate Normal Distributions with Different Covariance Matrices , 1962 .

[2]  Gerard Salton,et al.  On the Specification of Term Values in Automatic Indexing , 1973 .

[3]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[4]  J A Swets,et al.  Information Retrieval Systems. , 1963, Science.

[5]  Maurice B. Line,et al.  PROGRESS IN DOCUMENTATION: ‘obsolescence’ and changes in the use of literature with time , 1974 .

[6]  B. C. Brookes THE MEASURES OF INFORMATION RETRIEVAL EFFECTIVENESS PROPOSED BY SWETS , 1968 .

[7]  A. J. Meadows Communication in science , 1974 .

[8]  M. H. Heine Design equations for retrieval systems based on the swets model , 1974, J. Am. Soc. Inf. Sci..

[9]  B. C. Brookes Obsolescence of special library periodicals: Sampling errors and utility contours , 1970 .

[10]  Nicholas M. DiFondi STATISTICAL INFORMATION RETRIEVAL SYSTEM , 1969 .

[11]  John A. Swets,et al.  Effectiveness of information retrieval methods , 1969 .

[12]  D. R. Elchesen,et al.  General: Effectiveness of Combining Title Words and Index Terms in Machine Retrieval Searches , 1972, Nature.

[13]  R. A. Fox,et al.  Introduction to Mathematical Statistics , 1947 .

[14]  Karen Spärck Jones Index term weighting , 1973, Inf. Storage Retr..

[15]  M. H. Heine Distance between sets as an objective measure of retrieval effectiveness , 1973, Inf. Storage Retr..

[16]  John W. Sammon,et al.  An Optimal Set of Discriminant Vectors , 1975, IEEE Transactions on Computers.

[17]  M. H. Heine MEASURES OF LANGUAGE EFFECTIVENESS AND THE SWETSIAN HYPOTHESES , 1975 .