Text Modeling using Unsupervised Topic Models and Concept Hierarchies

Statistical topic models provide a general data-driven fra mework for automated discovery of highlevel knowledge from large collections of text documents. While topic models can potentially discover a broad range of themes in a data set, the interpretabil ity of the learned topics is not always ideal. Human-defined concepts, on the other hand, tend to be s emantically richer due to careful selection of words to define concepts but they tend not to cover t he themes in a data set exhaustively. In this paper, we propose a probabilistic framework to combine a hierarchy of human-defined semantic concepts with statistical topic models to seek the best of both worlds. Experimental results using two different sources of concept hierarchies and two collections of text documents indicate that this combination leads to systematic improvements in the quality of the associated language models as well as enabling new techniques for inferring and visualizing the semantics of a document.

[1]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[2]  Thomas L. Griffiths,et al.  The nested Chinese restaurant process and Bayesian inference of topic hierarchies , 2007 .

[3]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[4]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[9]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[10]  Harith Alani,et al.  Metrics for Ranking Ontologies , 2006, EON@WWW.

[11]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[12]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[13]  Gerhard Weikum,et al.  Learning Word-to-Concept Mappings for Automatic Text Classification , 2005, ICML 2005.

[14]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[15]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[16]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[17]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.