A Note on EM Algorithm for Probabilistic Latent Semantic Analysis

In many text collections, we encounter the scenario that a document contains multiple topics. Extracting such topics/subtopics/themes from the text collection is important for many text mining tasks, such as search result organization, subtopic retrieval, passage segmentation, document clustering, and contextual text mining. A well accepted practice is to explain the generation of each document with a probabilistic topic model. In such a model, every topic is represented by a multinomial distribution on the vocabulary (i.e., a unigram language model). Correspondingly, such a probabilistic topic model is usually chosen to be a mixture model of k components, each of which is a topic. One of the standard probabilistic topic models is the Probabilistic Latent Semantic Analysis (PLSA), which is also known as Probabilistic Latent Semantic Indexing (PLSI) when used in information retrieval [3]. The basic idea of PLSA is to treat the words in each document as observations from a mixture model where the component models are the topic word distributions. The selection of different components are controlled by a set of mixing weights. Words in the same document share the same mixing weights. We may add a component of background and use it to explain the non-topical words (functional words) with a background word distribution. Specifically, letθ1, ..., θk bek topic unigram language models (i.e., word distributions) and θB be a background model for the whole collectionC. A word w in a document d is regarded as a sample of the following mixture model (based on word generation).