Using Stochastic Helmholtz Machine for Text Learning

We present an approach for text analysis, especially for topic words extraction and document classification, based on a probabilistic generative model. Generative models are useful since they can extract the underlying causal structure of data objects. For this model, a stochastic Helmholtz machine is used and it is fitted using the wake-sleep algorithm, a simple stochastic learning algorithm. Given a document set, the Helmholtz machine tries to capture the correlation of the words used in the set, thus can extract various semantic features for a set of documents. We present some experimental results on topic words extraction for TDT-2 and TREC-8 ad-hoc data sets. And for another real-world document set, 20 Newsgroup collection, a categorization is performed and the performance is compared with that of naive Bayes classifier, another simple generative model. Additionally, we present a preliminary work to make Helmholtz machines more appropriate for processing text documents.