Unsupervised Document Classification with Informed Topic Models

Document classification is an important and common application in natural language processing. Scaling classification approaches to many targets faces a bottleneck in acquiring gold standard labels. In this work, we develop and evaluate a method for using informed topic models to noisily label documents, creating a noisy but usable set of labels for training discriminative classifiers. We investigate multiple ways to train this noisy classifier, and the best performing method uses Wikipedia-seeded topic models to approximately label training instances without any supervision. We evaluate these methods on the classification task as well as in an active learning setting, in which they are shown to improve learning rates over traditional active learning.

[1]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Thomas L. Griffiths,et al.  The Author-Topic Model for Authors and Documents , 2004, UAI.

[3]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[4]  Dietrich Rebholz-Schuhmann,et al.  Calbc Silver Standard Corpus , 2010, J. Bioinform. Comput. Biol..

[5]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[6]  Laura Schweitzer,et al.  Advances In Kernel Methods Support Vector Learning , 2016 .

[7]  Wei Li,et al.  Pachinko allocation: DAG-structured mixture models of topic correlations , 2006, ICML.

[8]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[9]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[10]  Grigorios Tsoumakas,et al.  Multi-Label Classification: An Overview , 2007, Int. J. Data Warehous. Min..

[11]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[12]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Geoff Holmes,et al.  Classifier chains for multi-label classification , 2009, Machine Learning.

[15]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[16]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.