Improving Topic Models using Conceptual Data

Laboratoire ERIC, Universit e Lyon 2 , 5 Avenue P. Mendes, Bron, 69676, France {marian-andrei.rizoiu, julien.velcin}@univ-lyon2.fr Abstract. We propose a system which employs conceptual knowl edge to improve topic models by removing unrelated words fr om the simplified topic description. We use WordNet to detect which topical words are not conceptually similar to the others and then test ou r assumptions against human judgment. Results obtained on two different corpora in different test conditions show that the words detected as unrelated had a muc h greater probability than the others to be chosen by human evaluators as not being part of the topic at all. We prove that there is a strong correlation between the said probability and an automatically calculated topical fitness and we dis cuss the variation of the correlation depending on the method and data used. Keywords: Topic Models, Ontologies, Evaluation, Improvement

[1]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[2]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[3]  Graeme Hirst,et al.  Evaluating WordNet-based Measures of Lexical Semantic Relatedness , 2006, CL.

[4]  Rada Mihalcea,et al.  Word semantics for information retrieval: moving one step closer to the Semantic Web , 2001, Proceedings 13th IEEE International Conference on Tools with Artificial Intelligence. ICTAI 2001.

[5]  Ramesh Nallapati Semantic Language Models for Topic Detection and Tracking , 2003, HLT-NAACL.

[6]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[7]  Timothy Baldwin,et al.  Automatic Evaluation of Topic Coherence , 2010, NAACL.

[8]  A. McCallum,et al.  Topical N-Grams: Phrase and Topic Discovery, with an Application to Information Retrieval , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[9]  William J. Byrne,et al.  A Generative Probabilistic OCR Model for NLP Applications , 2003, NAACL.

[10]  David M. Blei,et al.  Syntactic Topic Models , 2008, NIPS.

[11]  John D. Lafferty,et al.  Dynamic topic models , 2006, ICML.

[12]  John D. Lafferty,et al.  Correlated Topic Models , 2005, NIPS.

[13]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[14]  Mohamed S. Kamel,et al.  CorePhrase: Keyphrase Extraction for Document Clustering , 2005, MLDM.

[15]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[16]  Stan Matwin,et al.  Text Classification Using WordNet Hypernyms , 1998, WordNet@ACL/COLING.

[17]  Ruslan Salakhutdinov,et al.  Evaluation methods for topic models , 2009, ICML '09.

[18]  Xiaojin Zhu,et al.  A Topic Model for Word Sense Disambiguation , 2007, EMNLP.

[19]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.