A probabilistic model for Latent Semantic Indexing: Research Articles

Latent Semantic Indexing (LSI), when applied to semantic space built on text collections, improves information retrieval, information filtering, and word sense disambiguation. A new dual probability model based on the similarity concepts is introduced to provide deeper understanding of LSI. Semantic associations can be quantitatively characterized by their statistical significance, the likelihood. Semantic dimensions containing redundant and noisy information can be separated out and should be ignored because their negative contribution to the overall statistical significance. LSI is the optimal solution of the model. The peak in the likelihood curve indicates the existence of an intrinsic semantic dimension. The importance of LSI dimensions follows the Zipf-distribution, indicating that LSI dimensions represent latent concepts. Document frequency of words follows the Zipf distribution, and the number of distinct words follows log-normal distribution. Experiments on five standard document collections confirm and illustrate the analysis. © 2005 Wiley Periodicals, Inc.

[1]  Anna R. Karlin,et al.  Spectral analysis of data , 2001, STOC '01.

[2]  K. Fan On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations: II. , 1949, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[4]  C. Ding A similarity-based probability model for latent semantic indexing , 1999, SIGIR '99.

[5]  C. J. van Rijsbergen,et al.  Investigating the relationship between language model perplexity and IR precision-recall measures , 2003, SIGIR.

[6]  C. Eckart,et al.  The approximation of one matrix by another of lower rank , 1936 .

[7]  Chris H. Q. Ding,et al.  Term norm distribution and its effects on Latent Semantic Indexing , 2005, Inf. Process. Manag..

[8]  Jon M. Kleinberg,et al.  Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text , 1998, Comput. Networks.

[9]  Georges Dupret,et al.  Latent concepts and the number orthogonal factors in latent semantic analysis , 2003, SIGIR.

[10]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[11]  Chris H. Q. Ding,et al.  K-means clustering via principal component analysis , 2004, ICML.

[12]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[13]  John Caron,et al.  Experiments with LSA scoring: optimal rank and basis , 2001 .

[14]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[15]  Hongyuan Zha,et al.  Large-Scale SVD and Subspace-Based Methods for Information Retrieval , 1998, IRREGULAR.

[16]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[17]  Yanhong Li Toward A Qualitative Search Engine , 1998, IEEE Internet Comput..

[18]  Ray R. Larson,et al.  Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace , 1996 .

[19]  Slava M. Katz Distribution of content words and phrases in text and language modelling , 1996, Natural Language Engineering.

[20]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[21]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[22]  Santosh S. Vempala,et al.  Latent semantic indexing: a probabilistic analysis , 1998, PODS '98.

[23]  George Karypis,et al.  Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval , 2000, CIKM '00.

[24]  Tamara G. Kolda,et al.  A semidiscrete matrix decomposition for latent semantic indexing information retrieval , 1998, TOIS.

[25]  Fan Jiang,et al.  Approximate Dimension Equalization in Vector-based Information Retrieval , 2000, ICML.

[26]  Garrison W. Cottrell,et al.  Representing documents using an explicit model of their similarities , 1995 .

[27]  R. E. Story,et al.  An Explanation of the Effectiveness of Latent Semantic Indexing by Means of a Bayesian Regression Model , 1996, Inf. Process. Manag..

[28]  Andrew McCallum,et al.  Distributional clustering of words for text classification , 1998, SIGIR '98.

[29]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[30]  Martin Dillon,et al.  Application of Loglinear Models to Informetric Phenomena , 1992, Inf. Process. Manag..

[31]  Norbert Fuhr,et al.  Probabilistic Models in Information Retrieval , 1992, Comput. J..

[32]  Steven Glassman,et al.  A Caching Relay for the World Wide Web , 1994, Comput. Networks ISDN Syst..

[33]  Hinrich Schütze,et al.  Automatic Word Sense Discrimination , 1998, Comput. Linguistics.

[34]  Yiming Yang,et al.  Noise reduction in a statistical approach to text categorization , 1995, SIGIR '95.

[35]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[36]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[37]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[38]  Hongyuan Zha,et al.  Low-Rank Approximations with Sparse Factors I: Basic Algorithms and Error Analysis , 2001, SIAM J. Matrix Anal. Appl..

[39]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[40]  Chris H. Q. Ding,et al.  Spectral Relaxation for K-means Clustering , 2001, NIPS.