Improving the retrieval of information from external sources

A major barrier to successful retrieval from external sources (e.g., electronic databases) is the tremendous variability in the words that people use to describe objects of interest. The fact that different authors use different words to describe essentially the same idea means that relevant objects will be missed; conversely, the fact that the same word can be used to refer to many different things means that irrelevant objects will be retrieved. We describe a statistical method called latent semantic indexing, which models the implicit higher order structure in the association of words and objects and improves retrieval performance by up to 30%. Additional large performance improvements of 40% and 67% can be achieved through the use of differential term weighting and iterative retrieval methods.

[1]  Frank B. Baker,et al.  Information Retrieval Based upon Latent Class Analysis , 1962, JACM.

[2]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[3]  J A Swets,et al.  Information Retrieval Systems. , 1963, Science.

[4]  P G Ossorio,et al.  Classification Space: A Multivariate Procedure For Automatic? Document Indexing And Retrieval. , 1966, Multivariate behavioral research.

[5]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[6]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[7]  Karen Sparck Jones A statistical interpretation of term specificity and its application in retrieval , 1972 .

[8]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[9]  Robert N. Oddy,et al.  INFORMATION RETRIEVAL THROUGH MAN‐MACHINE DIALOGUE , 1977 .

[10]  F. Grund Forsythe, G. E. / Malcolm, M. A. / Moler, C. B., Computer Methods for Mathematical Computations. Englewood Cliffs, New Jersey 07632. Prentice Hall, Inc., 1977. XI, 259 S , 1979 .

[11]  Matthew B. Koll WEIRD: an approach to concept-based information retrieval , 1979, SIGF.

[12]  Michael David Williams,et al.  What Makes RABBIT Run? , 1984, Int. J. Man Mach. Stud..

[13]  J. Cullum,et al.  Lanczos Algorithms for Large Symmetric Eigenvalue Computations Vol. I Theory , 1984 .

[14]  J. Cullum,et al.  Lanczos algorithms for large symmetric eigenvalue computations , 1985 .

[15]  M. E. Maron,et al.  An evaluation of retrieval effectiveness for a full-text document-retrieval system , 1985, CACM.

[16]  Craig Stanfill,et al.  Parallel free-text search on the connection machine system , 1986, CACM.

[17]  Marcia J. Bates,et al.  Subject access in online catalogs: A design model , 1986 .

[18]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[19]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[20]  Susan T. Dumais,et al.  Using latent semantic analysis to improve information retrieval , 1988, CHI 1988.

[21]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .