Partially labeled topic models for interpretable text mining

Abstract Much of the world's electronic text is annotated with human-interpretable labels, such as tags on web pages and subject codes on academic publications. Effective text mining in this setting requires models that can flexibly account for the textual patterns that underlie the observed labels while still discovering unlabeled topics. Neither supervised classification, with its focus on label prediction, nor purely unsupervised learning, which does not model the labels explicitly, is appropriate. In this paper, we present two new partially supervised generative models of labeled text, Partially Labeled Dirichlet Allocation (PLDA) and the Partially Labeled Dirichlet Process (PLDP). These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. We explore applications with qualitative case studies of tagged web pages from del.icio.us and PhD dissertation abstracts, demonstrating improved model interpretability over traditional topic models. We use the many tags present in our del.icio.us dataset to quantitatively demonstrate the new models' higher correlation with human relatedness scores over several strong baselines.

[1]  Vittorio Loreto,et al.  Collective dynamics of social annotation , 2009, Proceedings of the National Academy of Sciences.

[2]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[3]  Yee Whye Teh,et al.  On Smoothing and Inference for Topic Models , 2009, UAI.

[4]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[5]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[6]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[7]  Bo Thiesson,et al.  Markov Topic Models , 2009, AISTATS.

[8]  Hal Daumé,et al.  Markov Random Topic Fields , 2009, ACL/IJCNLP.

[9]  Philip M. Long,et al.  Online Learning of Multiple Tasks with a Shared Loss , 2007, J. Mach. Learn. Res..

[10]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[11]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[12]  Susan T. Dumais,et al.  Characterizing Microblogs with Topic Models , 2010, ICWSM.

[13]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[14]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[15]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[16]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  Chong Wang,et al.  Reading Tea Leaves: How Humans Interpret Topic Models , 2009, NIPS.

[18]  Timothy N. Rubin Modeling Tag Dependencies in Tagged Documents , 2009 .

[19]  Jieping Ye,et al.  Extracting shared subspace for multi-label classification , 2008, KDD.

[20]  Yulan He,et al.  Joint sentiment/topic model for sentiment analysis , 2009, CIKM.

[21]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[22]  Ivan Titov,et al.  Modeling online reviews with multi-grain topic models , 2008, WWW.

[23]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[24]  Georgia Koutrika,et al.  Can social bookmarking improve web search? , 2008, WSDM '08.

[25]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[26]  Naonori Ueda,et al.  Modeling Social Annotation Data with Content Relevance using a Topic Model , 2009, NIPS.

[27]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[28]  Gregor Heinrich Parameter estimation for text analysis , 2009 .

[29]  Andrew McCallum,et al.  Expertise modeling for matching papers with reviewers , 2007, KDD '07.

[30]  Regina Barzilay,et al.  Learning Document-Level Semantic Properties from Free-Text Annotations , 2008, ACL.

[31]  David M. Blei,et al.  Connections between the lines: augmenting social networks with text , 2009, KDD.

[32]  Arindam Banerjee,et al.  Probabilistic Semi-Supervised Clustering with Constraints , 2006, Semi-Supervised Learning.

[33]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[34]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.