Effective Document Labeling with Very Few Seed Words: A Topic Model Approach

Developing text classifiers often requires a large number of labeled documents as training examples. However, manually labeling documents is costly and time-consuming. Recently, a few methods have been proposed to label documents by using a small set of relevant keywords for each category, known as dataless text classification. In this paper, we propose a Seed-Guided Topic Model (named STM) for the dataless text classification task. Given a collection of unlabeled documents, and for each category a small set of seed words that are relevant to the semantic meaning of the category, the STM predicts the category labels of the documents through topic influence. STM models two kinds of topics: category-topics and general-topics. Each category-topic is associated with one specific category, representing its semantic meaning. The general-topics capture the global semantic information underlying the whole document collection. STM assumes that each document is associated with a single category-topic and a mixture of general-topics. A novelty of the model is that STM learns the topics by exploiting the explicit word co-occurrence patterns between the seed words and regular words (i.e., non-seed words) in the document collection. A document is then labeled, or classified, based on its posterior category-topic assignment. Experiments on two widely used datasets show that STM consistently outperforms the state-of-the-art dataless text classifiers. In some tasks, STM can also achieve comparable or even better classification accuracy than the state-of-the-art supervised learning solutions. Our experimental results further show that STM is insensitive to the tuning parameters. Stable performance with little variation can be achieved in a broad range of parameter settings, making it a desired choice for real applications.

[1]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[2]  Doug Downey,et al.  Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification , 2008, NIPS.

[3]  Pengtao Xie,et al.  Integrating Document Clustering and Topic Modeling , 2013, UAI.

[4]  Andrew McCallum,et al.  Optimizing Semantic Coherence in Topic Models , 2011, EMNLP.

[5]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[6]  Padhraic Smyth,et al.  Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model , 2006, NIPS.

[7]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[8]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[9]  Peng Jin,et al.  Dataless Text Classification with Descriptive LDA , 2015, AAAI.

[10]  A. J. Stam,et al.  Estimation of statistical errors in molecular simulation calculations , 1986 .

[11]  Andrew McCallum,et al.  Rethinking LDA: Why Priors Matter , 2009, NIPS.

[12]  David M. Blei,et al.  Supervised Topic Models , 2007, NIPS.

[13]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  Arjun Mukherjee,et al.  Leveraging Multi-Domain Prior Knowledge in Topic Models , 2013, IJCAI.

[16]  Bing Liu,et al.  Mining topics in documents: standing on the shoulders of big data , 2014, KDD.

[17]  Evgeniy Gabrilovich,et al.  Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis , 2007, IJCAI.

[18]  Aixin Sun,et al.  Topic Modeling for Short Texts with Auxiliary Word Embeddings , 2016, SIGIR.

[19]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[21]  Sutanu Chakraborti,et al.  Visualizing and Evaluating Complexity of Textual Case Bases , 2008, ECCBR.

[22]  Sutanu Chakraborti,et al.  Topic labeled text classification: a weakly supervised approach , 2014, SIGIR.

[23]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[24]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[25]  Sutanu Chakraborti,et al.  Document classification by topic labeling , 2013, SIGIR.

[26]  Hal Daumé,et al.  Incorporating Lexical Priors into Topic Models , 2012, EACL.

[27]  Minyi Guo,et al.  A class-feature-centroid classifier for text categorization , 2009, WWW '09.

[28]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.

[29]  Arjun Mukherjee,et al.  Aspect Extraction through Semi-Supervised Modeling , 2012, ACL.

[30]  Xiaojin Zhu,et al.  Incorporating domain knowledge into topic modeling via Dirichlet Forest priors , 2009, ICML '09.

[31]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[32]  Carlo Strapparava,et al.  Improving text categorization bootstrapping via unsupervised learning , 2009, TSLP.

[33]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..