Multinomial PCA for extracting major latent topics from document streams

We propose a new unsupervised learning method called multinomial PCA (MuPCA) for efficiently extracting the major latent topics from a document stream based on the "bag-of-words" (BOW) representation of a document. Unlike PCA, MuPCA follows a suitable probabilistic generative model for the document stream represented as time-series of word-frequency vectors. Using real data of document streams on the Web, we experimentally demonstrate the effectiveness of the proposed method.