Random indexing of text samples for latent semantic analysis

Random Indexing of Text Samples for Latent Semantic Analysis Pentti Kanerva Jan Kristoferson Anders Holst kanerva@sics.se, janke@sics.se, aho@sics.se RWCP Theoretical Foundation SICS Laboratory Swedish Institute of Computer Science, Box 1263, SE-16429 Kista, Sweden Latent Semantic Analysis is a method of computing vectors|and it has several randomly placed ; 1s and high-dimensional semantic vectors, or context vectors, 1s, with the rest 0s (e.g., four each of ; 1 and 1, or for words from their co-occurrence statistics. An exper- eight non-0s in 1,800, instead of one non-0 in 30,000 iment by Landauer & Dumais (1997) covers a vocabu- as above). Thus, we would accumulate the same data lary of 60,000 words (unique letter strings delimited by into a 60,000 1,800 words-by-contexts matrix instead word-space characters) in 30,000 contexts (text samples of 60,000 30,000. or \documents of about 150 words each). The data are Our method has been veried with dierent data, a rst collected into a 60,000 30,000 words-by-contexts ten-million-word \TASA corpus consisting of a 79,000- co-occurrence matrix, with each row representing a word word vocabulary (when words are truncated after the 8th and each column representing a text sample so that each character) in 37,600 text samples. The data were accu- entry gives the frequency of a given word in a given mulated into a 79,000 1,800 words-by-contexts matrix, text sample. The frequencies are normalized, and the which was normalized by thresholding into a matrix of normalized matrix is transformed with Singular-Value ; 1s, 0s, and 1s. The unnormalized 1,800-dimensional Decomposition (SVD) reducing its original 30,000 doc- context vectors gave 35{44% correct in the TOEFL test ument dimensions into a much smaller number of latent and the normalized ones gave 48{51% correct, which cor- dimensions, 300 proving to be optimal. Thus words are respond to Landauer & Dumais' 36% for their normal- represented by 300-dimensional semantic vectors. ized 30,000-dimensional vectors before SVD, for a dier- The point in all of this is that the vectors capture ent corpus (see above). Our words-by-contexts matrix meaning. Landauer and Dumais demonstrate it with a can be transformed further, for example with SVD as in synonym test called TOEFL (for \Test Of English as a LSA, except that the matrix is much smaller. Mathematically, the 30,000- or 37,600-dimensional in- Foreign Language ). For each test word, four alterna- dex vectors are orthogonal, whereas the 1,800-dimen- tives are given, and the \contestant is asked to nd the one that's the most synonymous. Choosing at random sional ones are only nearly orthogonal. They seem to would yield 25% correct. However, when the seman- work just as well, in addition to which they are more tic vector for the test word is compared to the seman- \brainlike and less aected by the number of text sam- tic vectors for the four alternatives, it correlates most ples (1,800-dimensional index vectors can cover a wide- highly with the correct alternative in 64% of the cases. ranging number of text samples). We have used such However, when the same test is based on the 30,000- vectors also to index words in narrow context windows, dimensional vectors before SVD, the result is not nearly getting 62{70% correct, and conclude that random in- as good: only 36% correct. The authors conclude that dexing deserves to be studied and understood more fully. Acknowledgments. This research is supported by the reorganization of information by SVD somehow cor- Japan's Ministry of International Trade and Industry responds to human psychology. under the Real World Computing Partnership We have studied high-dimensional random distributed (MITI) (RWCP) The TASA corpus and 80 TOEFL representations, as models of brainlike representation of test items program. were made available to us by courtesy of Pro- information (Kanerva, 1994; Kanerva & Sjodin, 1999). fessor Thomas Landauer, University of Colorado. In this poster we report on the use of such a repre- sentation to reduce the dimensionality of the original words-by-contexts matrix. The method can be explained Kanerva, P. (1994). References The Spatter Code for encoding by looking at the 60,000 30,000 matrix of frequencies concepts at many levels. In M. Marinaro and P. G. above. Assume that each text sample is represented by a Morasso (eds.), ICANN '94, Proc. Int'l Conference 30,000-bit vector with a single 1 marking the place of the on Articial Neural Networks (Sorrento, Italy), vol. 1, sample in a list of all samples, and call it the sample's pp. 226{229. London: Springer-Verlag. index vector (i.e., the n th bit of the index vector for the Kanerva, P., and Sjodin, G. (1999). Stochastic Pattern n th text sample is 1|the representation is unitary or lo- Computing. Proc. 2000 Real World Computing Sym- cal). Then the words-by-contexts matrix of frequencies bosium (Report TR-99-002, pp. 271{276). Tsukuba- can be gotten by the following procedure: every time city, Japan: Real World Computing Partnership. that the word w occurs in the n th text sample, the n th index vector is added to the row for the word w . Landauer, T. K., and Dumais, S. T. (1997). A solution We use the same procedure for accumulating a words- to Plato's problem: The Latent Semantic Analysis by-contexts matrix, except that the index vectors are theory of the acquisition, induction, and representa- not unitary. A text-sample's index vector is \small tion of knowledge. Psychological Review 104 (2):211{ by comparison|we have used 1,800-dimensional index