Sprinkling Topics for Weakly Supervised Text Classification

Supervised text classification algorithms require a large number of documents labeled by humans, that involve a laborintensive and time consuming process. In this paper, we propose a weakly supervised algorithm in which supervision comes in the form of labeling of Latent Dirichlet Allocation (LDA) topics. We then use this weak supervision to “sprinkle” artificial words to the training documents to identify topics in accordance with the underlying class structure of the corpus based on the higher order word associations. We evaluate this approach to improve performance of text classification on three real world datasets.

[1]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[2]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[3]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[4]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[5]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[6]  Abhay Harpale,et al.  Document Classification Through Interactive Supervision of Document and Term Labels , 2004, PKDD.

[7]  Sutanu Chakraborti,et al.  Document classification by topic labeling , 2013, SIGIR.

[8]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[9]  James C. Wetherbe,et al.  An Empirical Comparison of Four Text Mining Methods , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[10]  Sutanu Chakraborti,et al.  Supervised Latent Semantic Indexing Using Adaptive Sprinkling , 2007, IJCAI.

[11]  Michael I. Jordan,et al.  DiscLDA: Discriminative Learning for Dimensionality Reduction and Classification , 2008, NIPS.

[12]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[13]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[14]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[15]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[16]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[17]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Andrew McCallum,et al.  Active Learning by Labeling Features , 2009, EMNLP.

[20]  Eric P. Xing,et al.  MedLDA: maximum margin supervised topic models for regression and classification , 2009, ICML '09.