Indexing by Latent Semantic Analysis

A new method for automatic indexing and retrieval is described. The approach is to take advantage of implicit higher-order structure in the association of terms with documents (“semantic structure”) in order to improve the detection of relevant documents on the basis of terms found in queries. The particular technique used is singular-value decomposition, in which a large term by document matrix is decomposed into a set of ca. 100 orthogonal factors from which the original matrix can be approximated by linear combination. Documents are represented by ca. 100 item vectors of factor weights. Queries are represented as pseudo-document vectors formed from weighted combinations of terms, and documents with supra-threshold cosine values are returned. initial tests find this completely automatic method for retrieval to be promising.

[1]  C. Coombs A theory of data. , 1965, Psychology Review.

[2]  Harold Borko,et al.  Automatic Document Classification , 1963, JACM.

[3]  P G Ossorio,et al.  Classification Space: A Multivariate Procedure For Automatic? Document Indexing And Retrieval. , 1966, Multivariate behavioral research.

[4]  Michael E. Lesk,et al.  Relevance assessments and retrieval system evaluation , 1968, Inf. Storage Retr..

[5]  Gerard Salton,et al.  Automatic Information Organization And Retrieval , 1968 .

[6]  J. Chang,et al.  Analysis of individual differences in multidimensional scaling via an n-way generalization of “Eckart-Young” decomposition , 1970 .

[7]  Richard A. Harshman,et al.  Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-model factor analysis , 1970 .

[8]  Karen Sparck Jones Automatic keyword classification for information retrieval , 1971 .

[9]  C. J. van Rijsbergen,et al.  The use of hierarchic clustering in information retrieval , 1971, Inf. Storage Retr..

[10]  Michael A. Malcolm,et al.  Computer methods for mathematical computations , 1977 .

[11]  Van Rijsbergen,et al.  A theoretical basis for the use of co-occurence data in information retrieval , 1977 .

[12]  F. Grund Forsythe, G. E. / Malcolm, M. A. / Moler, C. B., Computer Methods for Mathematical Computations. Englewood Cliffs, New Jersey 07632. Prentice Hall, Inc., 1977. XI, 259 S , 1979 .

[13]  Franklin T. Luk,et al.  A Block Lanczos Method for Computing the Singular Values and Corresponding Singular Vectors of a Matrix , 1981, TOMS.

[14]  J. Cullum,et al.  A Lanczos Algorithm for Computing Singular Values and Vectors of Large Matrices , 1983 .

[15]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[16]  S. T. Dumais,et al.  Human factors and behavioral science: Statistical semantics: Analysis of the potential performance of key-word information systems , 1983, The Bell System Technical Journal.

[17]  R. A. Amsler Machine-readable dictionaries , 1984 .

[18]  R. A. Harshman,et al.  Data preprocessing and the extended PARAFAC model , 1984 .

[19]  H. Law Research methods for multimode data analysis , 1984 .

[20]  George W. Furnas,et al.  Experience with an adaptive indexing scheme , 1985, CHI '85.

[21]  Yaacov Choueka,et al.  Disambiguation by short contexts , 1985, Comput. Humanit..

[22]  W. DeSarbo,et al.  Three-way metric unfolding via alternating weighted least squares , 1985 .

[23]  Marcia J. Bates,et al.  Subject access in online catalogs: A design model , 1986 .

[24]  Vijay V. Raghavan,et al.  A critical analysis of vector space model for information retrieval , 1986 .

[25]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.