论文信息 - A Note on EM Algorithm for Probabilistic Latent Semantic Analysis

A Note on EM Algorithm for Probabilistic Latent Semantic Analysis

In many text collections, we encounter the scenario that a document contains multiple topics. Extracting such topics/subtopics/themes from the text collection is important for many text mining tasks, such as search result organization, subtopic retrieval, passage segmentation, document clustering, and contextual text mining. A well accepted practice is to explain the generation of each document with a probabilistic topic model. In such a model, every topic is represented by a multinomial distribution on the vocabulary (i.e., a unigram language model). Correspondingly, such a probabilistic topic model is usually chosen to be a mixture model of k components, each of which is a topic. One of the standard probabilistic topic models is the Probabilistic Latent Semantic Analysis (PLSA), which is also known as Probabilistic Latent Semantic Indexing (PLSI) when used in information retrieval [3]. The basic idea of PLSA is to treat the words in each document as observations from a mixture model where the component models are the topic word distributions. The selection of different components are controlled by a set of mixing weights. Words in the same document share the same mixing weights. We may add a component of background and use it to explain the non-topical words (functional words) with a background word distribution. Specifically, letθ1, ..., θk bek topic unigram language models (i.e., word distributions) and θB be a background model for the whole collectionC. A word w in a document d is regarded as a sample of the following mixture model (based on word generation).

ChengXiang Zhai | Qiaozhu Mei

[1] ChengXiang Zhai,et al. A Note on the Expectation-Maximization (EM) Algorithm , 2004 .

[2] ChengXiang Zhai,et al. Discovering evolutionary theme patterns from text: an exploration of temporal text mining , 2005, KDD '05.

[3] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[4] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5] Tao Tao,et al. Regularized estimation of mixture models for robust pseudo-relevance feedback , 2006, SIGIR.

[6] Bei Yu,et al. A cross-collection mixture model for comparative text mining , 2004, KDD.

[7] G. McLachlan,et al. The EM algorithm and extensions , 1996 .