Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning

Human-defined concepts are fundamental building-blocks in constructing knowledge bases such as ontologies. Statistical learning techniques provide an alternative automated approach to concept definition, driven by data rather than prior knowledge. In this paper we propose a probabilistic modeling framework that combines both human-defined concepts and data-driven topics in a principled manner. The methodology we propose is based on applications of statistical topic models (also known as latent Dirichlet allocation models). We demonstrate the utility of this general framework in two ways. We first illustrate how the methodology can be used to automatically tag Web pages with concepts from a known set of concepts without any need for labeled documents. We then perform a series of experiments that quantify how combining human-defined semantic knowledge with data-driven techniques leads to better language models than can be obtained with either alone.

[1]  T. Landauer,et al.  A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. , 1997 .

[2]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[3]  Steffen Staab,et al.  S-CREAM: Semiautomatic CREAtion of Metadata , 2002, SAAKM@ECAI.

[4]  Juan-Zi Li,et al.  Tree-Structured Conditional Random Fields for Semantic Annotation , 2006, International Semantic Web Conference.

[5]  Mark Steyvers,et al.  Topics in semantic representation. , 2007, Psychological review.

[6]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[7]  Evgeniy Gabrilovich,et al.  Harnessing the Expertise of 70, 000 Human Editors: Knowledge-Based Feature Generation for Text Categorization , 2007, J. Mach. Learn. Res..

[8]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Padhraic Smyth,et al.  Text Modeling using Unsupervised Topic Models and Concept Hierarchies , 2008, ArXiv.

[11]  Deborah L. McGuinness,et al.  Ontologies Come of Age , 2003, Spinning the Semantic Web.

[12]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[13]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[14]  Thomas Hofmann,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2007 .

[15]  Ramanathan V. Guha,et al.  SemTag and seeker: bootstrapping the semantic web via automated semantic annotation , 2003, WWW '03.

[16]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[17]  Yorick Wilks,et al.  Data Driven Ontology Evaluation , 2004, LREC.

[18]  Sasikumar Mukundan,et al.  Spinning the Semantic Web , 2004 .

[19]  Atanas Kiryakov,et al.  KIM - Semantic Annotation Platform , 2003, SEMWEB.

[20]  Gerhard Weikum,et al.  Learning Word-to-Concept Mappings for Automatic Text Classification , 2005, ICML 2005.

[21]  Harith Alani,et al.  Metrics for Ranking Ontologies , 2006, EON@WWW.

[22]  Steffen Staab,et al.  Text Clustering Based on Background Knowledge , 2003 .

[23]  Padhraic Smyth,et al.  Combining concept hierarchies and statistical topic models , 2008, CIKM '08.