A theoretical basis for the use of co-occurence data in information retrieval

This paper provides a foundation for a practical way of improving the effectiveness of an automatic retrieval system. Its main concern is with the weighting of index terms as a device for increasing retrieval effectiveness. Previously index terms have been assumed to be independent for the good reason that then a very simple weighting scheme can be used. In reality index terms are most unlikely to be independent. This paper explores one way of removing the independence assumption. Instead the extent of the dependence between index terms is measured and used to construct a non‐linear weighting function. In a practical situation the values of some of the parameters of such a function must be estimated from small samples of documents. So a number of estimation rules are discussed and one in particular is recommended. Finally the feasibility of the computations required for a non‐linear weighting scheme is examined.

[1]  H. Steinhaus The Problem of Estimation , 1957 .

[2]  Solomon Kullback,et al.  Information Theory and Statistics , 1960 .

[3]  M. E. Maron,et al.  On Relevance, Probabilistic Indexing and Information Retrieval , 1960, JACM.

[4]  Irving John Good,et al.  The Estimation of Probabilities: An Essay on Modern Bayesian Methods , 1965 .

[5]  Evan Leon Ivie Search procedures based on measures of relatedness between documents. , 1966 .

[6]  Robert B. Ash,et al.  Information Theory , 2020, The SAGE International Encyclopedia of Mass Media and Society.

[7]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[8]  C. N. Liu,et al.  Approximating discrete probability distributions with dependence trees , 1968, IEEE Trans. Inf. Theory.

[9]  J. Gower,et al.  Minimum Spanning Trees and Single Linkage Cluster Analysis , 1969 .

[10]  Solomon Kullback,et al.  Approximating discrete probability distributions , 1969, IEEE Trans. Inf. Theory.

[11]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[12]  Jack Minker,et al.  An evaluation of query expansion by the addition of clustered terms for a document retrieval system , 1972, Inf. Storage Retr..

[13]  D. Cox The Analysis of Multivariate Binary Data , 1972 .

[14]  V. Whitney,et al.  Algorithm 422: minimal spanning tree [H] , 1972, CACM.

[15]  G. C. Tiao,et al.  Bayesian inference in statistical analysis , 1973 .

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  Tzay Y. Young,et al.  Classification, Estimation and Pattern Recognition , 1974 .

[18]  Rangasami L. Kashyap Minimax estimation with divergence loss function , 1974, Inf. Sci..

[19]  Clement T. Yu,et al.  Precision Weighting—An Effective Automatic Indexing Method , 1976, J. ACM.

[20]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[21]  Jon Louis Bentley,et al.  Fast Algorithms for Constructing Minimal Spanning Trees in Coordinate Spaces , 1978, IEEE Transactions on Computers.