Dynamic auto-encoders for semantic indexing

We present a new algorithm for topic modeling, text classification and retrieval, tailored to sequences of time-stamped documents. Based on the auto-encoder architecture, our nonlinear multi-layer model is trained stage-wise to produce increasingly more compact representations of bags-of-words at the document or paragraph level, thus performing a semantic analysis. It also incorporates simple temporal dynamics on the latent representations, to take advantage of the inherent structure of sequences of documents, and can simultaneously perform a supervised classification or regression on document labels. Learning this model is done by maximizing the joint likelihood of the model, and we use an approximate gradient-based MAP inference. We demonstrate that by minimizing a weighted cross-entropy loss between histograms of word occurrences and their reconstruction, we directly minimize the topic-model perplexity, and show that our topic model obtains lower perplexity than Latent Dirichlet Allocation on the NIPS and State of the Union datasets. We illustrate how the dynamical constraints help the learning while enabling to visualize the topic trajectory. Finally, we demonstrate state-of-the-art information retrieval and classification results on the Reuters collection, as well as an application to volatility forecasting from financial news.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[3]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[4]  L. K. Hansen,et al.  Independent Components in Text , 2000 .

[5]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Fabrizio Sebastiani,et al.  An Analysis of the Relative Hardness of Reuters-21578 Subsets , 2003 .

[9]  Peter V. Gehler,et al.  The rate adapting poisson model for information retrieval and object recognition , 2006, ICML.

[10]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[11]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[12]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[13]  Andrew McCallum,et al.  Topics over time: a non-Markov continuous-time model of topical trends , 2006, KDD '06.

[14]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[15]  Marc'Aurelio Ranzato,et al.  Sparse Feature Learning for Deep Belief Networks , 2007, NIPS.

[16]  Shlomo Geva,et al.  News Aware Volatility Forecasting: Is the Content of News Important? , 2007, AusDM.

[17]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[18]  Marc'Aurelio Ranzato,et al.  Semi-supervised learning of compact document representations with deep networks , 2008, ICML '08.

[19]  Geoffrey E. Hinton,et al.  Replicated Softmax: an Undirected Topic Model , 2009, NIPS.

[20]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[21]  Alexander M. Millkey The Black Swan: The Impact of the Highly Improbable , 2009 .

[22]  Lawrence Carin,et al.  Hierarchical Bayesian Modeling of Topics in Time-Stamped Documents , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Marshall F Chalverus,et al.  The Black Swan: The Impact of the Highly Improbable , 2007 .