Indexing by Latent Semantic Analysis

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

[1]  C. Coombs A theory of data. , 1965, Psychology Review.

[2]  Frank B. Baker,et al.  Information Retrieval Based upon Latent Class Analysis , 1962, JACM.

[3]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[4]  P G Ossorio,et al.  Classification Space: A Multivariate Procedure For Automatic? Document Indexing And Retrieval. , 1966, Multivariate behavioral research.

[5]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[6]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[7]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[8]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[9]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[10]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[11]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[12]  Franklin T. Luk,et al.  A Block Lanczos Method for Computing the Singular Values and Corresponding Singular Vectors of a Matrix , 1981, TOMS.

[13]  J. Cullum,et al.  A Lanczos Algorithm for Computing Singular Values and Vectors of Large Matrices , 1983 .

[14]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[15]  R. A. Harshman,et al.  Data preprocessing and the extended PARAFAC model , 1984 .

[16]  H. Law Research methods for multimode data analysis , 1984 .

[17]  Susan T. Dumais,et al.  Statistical semantics: analysis of the potential performance of keyword information systems , 1984 .

[18]  Donald E. Walker,et al.  Machine-readable dictionaries , 1984 .

[19]  Ellen M. Vdorhees,et al.  The cluster hypothesis revisited , 1985, SIGIR '85.

[20]  George W. Furnas,et al.  Experience with an adaptive indexing scheme , 1985, CHI '85.

[21]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[22]  W. DeSarbo,et al.  Three-way metric unfolding via alternating weighted least squares , 1985 .

[23]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986, J. Am. Soc. Inf. Sci..

[24]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[25]  Marcia J. Bates,et al.  Subject access in online catalogs: A design model , 1986 .

[26]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[27]  Louis M. Gomez,et al.  All the Right Words: Finding What You Want as a Function of Richness of Indexing Vocabulary. , 1990 .