Combining Background Knowledge and Learned Topics

Statistical topic models provide a general data-driven framework for automated discovery of high-level knowledge from large collections of text documents. Although topic models can potentially discover a broad range of themes in a data set, the interpretability of the learned topics is not always ideal. Human-defined concepts, however, tend to be semantically richer due to careful selection of words that define the concepts, but they may not span the themes in a data set exhaustively. In this study, we review a new probabilistic framework for combining a hierarchy of human-defined semantic concepts with a statistical topic model to seek the best of both worlds. Results indicate that this combination leads to systematic improvements in generalization performance as well as enabling new techniques for inferring and visualizing the content of a document.

[1]  George W. Davidson,et al.  Roget's Thesaurus of English Words and Phrases , 1982 .

[2]  L. N. Kanal,et al.  Uncertainty in Artificial Intelligence 5 , 1990 .

[3]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[4]  Ramanathan V. Guha,et al.  Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project , 1990 .

[5]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[6]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[7]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[8]  Peter W. Foltz,et al.  An introduction to latent semantic analysis , 1998 .

[9]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[10]  Samuel Kaski,et al.  Self organization of a massive document collection , 2000, IEEE Trans. Neural Networks Learn. Syst..

[11]  P. Foltz,et al.  Content-based feedback 1 Supporting content-based feedback in online writing evaluation with LSA , 2000 .

[12]  C. Lee Giles,et al.  Clustering and identifying temporal trends in document databases , 2000, Proceedings IEEE Advances in Digital Libraries 2000.

[13]  Andrew McCallum,et al.  Efficient clustering of high-dimensional data sets with application to reference matching , 2000, KDD '00.

[14]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[15]  Steffen Staab,et al.  Ontology Learning for the Semantic Web , 2002, IEEE Intell. Syst..

[16]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[17]  Michael I. Jordan,et al.  Modeling annotated data , 2003, SIGIR.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[20]  E. Newport,et al.  Learning at a distance I. Statistical learning of non-adjacent dependencies , 2004, Cognitive Psychology.

[21]  Thomas L. Griffiths,et al.  Integrating Topics and Syntax , 2004, NIPS.

[22]  Tapabrata Maiti,et al.  Bayesian Data Analysis (2nd ed.) (Book) , 2004 .

[23]  Aleks Jakulin,et al.  Applying Discrete PCA in Data Analysis , 2004, UAI.

[24]  M. Hauser,et al.  Learning at a distance II. Statistical learning of non-adjacent dependencies in a non-human primate , 2004, Cognitive Psychology.

[25]  Thomas A. Schreiber,et al.  The University of South Florida free association, rhyme, and word fragment norms , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[26]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[27]  Wolf Vanpaemel,et al.  Dutch norm data for 13 semantic categories and 338 exemplars , 2004, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[28]  Timo Honkela,et al.  Websom for Textual Data Mining , 1999, Artificial Intelligence Review.

[29]  Simon Dennis,et al.  An unsupervised method for the extraction of propositional information from text , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Thomas L. Griffiths,et al.  Probabilistic author-topic models for information discovery , 2004, KDD.

[31]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[33]  Mark S. Seidenberg,et al.  Semantic feature production norms for a large set of living and nonliving things , 2005, Behavior research methods.

[34]  H. Basford,et al.  Optimal eye movement strategies in visual search , 2005 .

[35]  R. Aslin,et al.  Encoding multielement scenes: statistical learning of visual feature hierarchies. , 2005, Journal of experimental psychology. General.

[36]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[37]  Chen Yu,et al.  The Role of Embodied Intention in Early Lexical Acquisition , 2005, Cogn. Sci..

[38]  Padhraic Smyth,et al.  Analyzing Entities and Topics in News Articles Using Statistical Topic Models , 2006, ISI.

[39]  Michael J. Witbrock,et al.  Common Sense Reasoning - From Cyc to Intelligent Assistant , 2006, Ambient Intelligence in Everyday.

[40]  Harith Alani,et al.  Metrics for Ranking Ontologies , 2006, EON@WWW.

[41]  Hanna M. Wallach,et al.  Topic modeling: beyond bag-of-words , 2006, ICML.

[42]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[43]  Christopher D. Manning,et al.  Probabilistic models of language processing and acquisition , 2006, Trends in Cognitive Sciences.

[44]  Laura Dietz,et al.  Utilize Probabilistic Topic Models to Enrich Knowledge Bases , 2006 .

[45]  Gerhard Weikum,et al.  Transductive Learning for Text Classification Using Explicit Knowledge Models , 2006, PKDD.

[46]  Wei Li,et al.  Mixtures of hierarchical topics with Pachinko allocation , 2007, ICML '07.

[47]  Raymond Y. K. Lau,et al.  Mining Fuzzy Domain Ontology from Textual Databases , 2007, IEEE/WIC/ACM International Conference on Web Intelligence (WI'07).

[48]  W. Bruce Croft,et al.  Investigating Retrieval Performance with Manually-Built Topic Models , 2007, RIAO.

[49]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[50]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[51]  Wlodzimierz Drabent,et al.  Extending XML Query Language Xcerpt by Ontology Queries , 2007 .

[52]  Wei Li,et al.  Nonparametric Bayes Pachinko Allocation , 2007, UAI.

[53]  ChengXiang Zhai,et al.  Automatic labeling of multinomial topic models , 2007, KDD '07.

[54]  George A. Vouros,et al.  Mapping Ontologies Elements using Features in a Latent Space , 2007 .

[55]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[56]  Catherine Havasi,et al.  ConceptNet 3 : a Flexible , Multilingual Semantic Network for Common Sense Knowledge , 2007 .

[57]  Michael N Jones,et al.  Representing word meaning and order information in a composite holographic lexicon. , 2007, Psychological review.

[58]  George A. Vouros,et al.  Discovering Subsumption Hierarchies of Ontology Concepts from Text Corpora , 2007 .

[59]  Padhraic Smyth,et al.  Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning , 2008, SEMWEB.

[60]  Thomas L. Griffiths,et al.  Rational analysis as a link between human memory and information retrieval , 2008 .

[61]  Chong Wang,et al.  Continuous Time Dynamic Topic Models , 2008, UAI.

[62]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.

[63]  Charles Kemp,et al.  The discovery of structural form , 2008, Proceedings of the National Academy of Sciences.

[64]  Volker Tresp,et al.  Statistical modeling of medical indexing processes for biomedical knowledge information discovery from text , 2008 .

[65]  Padhraic Smyth,et al.  Text Modeling using Unsupervised Topic Models and Concept Hierarchies , 2008, ArXiv.

[66]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[67]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[68]  Robert M. Gonyea,et al.  Learning at a Distance : , 2009 .

[69]  Wayne D. Gray,et al.  Topics in Cognitive Science , 2009 .

[70]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[71]  T. Minka Estimating a Dirichlet distribution , 2012 .