Latent semantic analysis

We now described an interesting application of SVD to text do cuments. Suppose we represent documents as a bag of words, soXij is the number of times word j occurs in document i, for j = 1 : W andi = 1 : D, where W is the number of words and D is the number of documents. To find a document that contains a g iven word, we can use standard search procedures, but this can get confuse d by ynonomy (different words with the same meaning) andpolysemy (same word with different meanings). An alternative approa ch is to assume that X was generated by some low dimensional latent representation X̂ ∈ IR, whereK is the number of latent dimensions. If we compare documents in the latent space, we should get improved retrie val performance, because words of similar meaning get mapped to similar low dimensional locations. We can compute a low dimensional representation of X by computing the SVD, and then taking the top k singular values/ vectors: 1