Semi-Supervised Latent Dirichlet Allocation and Its Application for Document Classification

Latent Dirichlet Allocation (LDA) is an unsupervised topic modeling method widely applied in natural language processing. However, standard LDA does not permit the use of supervised labels to incorporate expert knowledge into the learning procedure. This paper describes a semi-supervised LDA (ssLDA) method that supports multiple-topic labels per document, to incorporate available expert knowledge during the model construction. This improvement enables the alignment of resulting model with human expectations for topic modeling and extraction. We apply ssLDA to document classification problem on benchmark datasets. We investigate and compare how the size of training set and proportion of supervised data affect the final model structure and improve the prediction accuracy.

[1]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[2]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[4]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[5]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[6]  Hector Garcia-Molina,et al.  Clustering the tagged web , 2009, WSDM '09.

[7]  David B. Dunson,et al.  A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation , 2009, NIPS.

[8]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.

[9]  Chong Wang,et al.  Simultaneous image classification and annotation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Xiaojin Zhu,et al.  Latent Dirichlet Allocation with Topic-in-Set Knowledge , 2009, HLT-NAACL 2009.

[11]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[12]  Ali Afzal,et al.  The Intelligent Campus (iCampus): End-to-End Learning Lifecycle of a Knowledge Ecosystem , 2010, 2010 Sixth International Conference on Intelligent Environments.

[13]  Susan T. Dumais,et al.  Partially labeled topic models for interpretable text mining , 2011, KDD.

[14]  Raviv Raich,et al.  Inference in Supervised latent Dirichlet allocation , 2011, 2011 IEEE International Workshop on Machine Learning for Signal Processing.