A linear space representation of language probability through SVD of N‐gram matrix

The number of parameters necessary for the word N-gram model is equal to the n-th power of the size of the vocabulary. As a result, compression of the parameter space is vital, depending on the field in question. In this research, singular value decomposition (SVD) of an N-pair word co-occurrence matrix is performed. The word and phrase state are taken to be vectors in a K-dimensional space. The authors then attempt to compress the N-gram probability parameter space using an approximation of the original matrix but with a lower number of dimensions. The results clearly show that in vector space, the Trigram model can be represented using roughly 17.5% fewer parameters. In addition, clustering is performed based on the distance in the defined space, and whether or not words are positioned appropriately in the linear space is investigated. These results confirm through a comparison using the same number of parameters that the entropy value is lower compared to the class model obtained using a method based on the maximization of the amount of mutual information, and that the positioning is good. © 2003 Wiley Periodicals, Inc. Electron Comm Jpn Pt 3, 86(8): 61–70, 2003; Published online in Wiley InterScience (www.interscience.wiley.com). DOI 10.1002/ecjc.10106