Latent concepts and the number orthogonal factors in latent semantic analysis

We seek insight into Latent Semantic Indexing by establishing a method to identify the optimal number of factors in the reduced matrix for representing a keyword. This method is demonstrated empirically by duplicating all documents containing a term t, and inserting new documents in the database that replace t with t'. By examining the number of times term t is identified for a search on term t' (precision) using differing ranges of dimensions, we find that lower ranked dimensions identify related terms and higher-ranked dimensions discriminate between the synonyms.

[1]  D. Bartholomew Latent Variable Models And Factor Analysis , 1987 .

[2]  Maureen Caudill Expert networks , 1990 .

[3]  David D. Lewis,et al.  Reuters-21578 Text Categorization Test Collection, Distribution 1.0 , 1997 .

[4]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[5]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[6]  Masaki Aono,et al.  Vector Space Models for Search and Cluster Mining , 2004 .

[7]  Ron Kohavi,et al.  Irrelevant Features and the Subset Selection Problem , 1994, ICML.

[8]  David B. Skalak,et al.  Prototype and Feature Selection by Sampling and Random Mutation Hill Climbing Algorithms , 1994, ICML.

[9]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[10]  Susan T. Dumais,et al.  Using Linear Algebra for Intelligent Information Retrieval , 1995, SIAM Rev..

[11]  Thorsten Joachims,et al.  A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization , 1997, ICML.

[12]  James Allan,et al.  Automatic structuring and retrieval of large text files , 1994, CACM.

[13]  Michael J. Pazzani,et al.  Syskill & Webert: Identifying Interesting Web Sites , 1996, AAAI/IAAI, Vol. 1.

[14]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[15]  Michael J. Pazzani,et al.  Learning and Revising User Profiles: The Identification of Interesting Web Sites , 1997, Machine Learning.

[16]  Thorsten Joachims,et al.  WebWatcher : A Learning Apprentice for the World Wide Web , 1995 .

[17]  Rich Caruana,et al.  Greedy Attribute Selection , 1994, ICML.

[18]  Michael I. Jordan,et al.  Unsupervised Learning from Dyadic Data , 1998 .