A Pseudo Label based Dataless Naive Bayes Algorithm for Text Classification with Seed Words

Traditional supervised text classifiers require a large number of manually labeled documents, which are often expensive to obtain. Recently, dataless text classification has attracted more attention, since it only requires very few seed words of categories that are much cheaper. In this paper, we develop a pseudo-label based dataless Naive Bayes (PL-DNB) classifier with seed words. We initialize pseudo-labels for each document using seed word occurrences, and employ the expectation maximization algorithm to train PL-DNB in a semi-supervised manner. The pseudo-labels are iteratively updated using a mixture of seed word occurrences and estimations of label posteriors. To avoid noisy pseudo-labels, we also consider the information of nearest neighboring documents in the pseudo-label update step, i.e., preserving local neighborhood structure of documents. We empirically show that PL-DNB outperforms traditional dataless text classification algorithms with seed words. Especially, PL-DNB performs well on the imbalanced dataset.

[1]  Sutanu Chakraborti,et al.  Topic labeled text classification: a weakly supervised approach , 2014, SIGIR.

[2]  Gideon S. Mann,et al.  Learning from labeled features using generalized expectation criteria , 2008, SIGIR '08.

[3]  Doug Downey,et al.  Look Ma, No Hands: Analyzing the Monotonic Feature Abstraction for Text Classification , 2008, NIPS.

[4]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[5]  Peng Jin,et al.  Dataless Text Classification with Descriptive LDA , 2015, AAAI.

[6]  Babak Saleh,et al.  Write a Classifier: Zero-Shot Learning Using Purely Textual Descriptions , 2013, 2013 IEEE International Conference on Computer Vision.

[7]  Youngjoong Ko,et al.  Learning with Unlabeled Data for Text Categorization Using a Bootstrapping and a Feature Projection Technique , 2004, ACL.

[8]  Li Zhao,et al.  Semi-Supervised Multinomial Naive Bayes for Text Classification by Leveraging Word-Level Statistical Constraint , 2016, AAAI.

[9]  Andrew McCallum,et al.  Efficient methods for topic model inference on streaming document collections , 2009, KDD.

[10]  Bo Zhang,et al.  Semi-supervised Max-margin Topic Model with Manifold Posterior Regularization , 2017, IJCAI.

[11]  Dan Roth,et al.  On Dataless Hierarchical Text Classification , 2014, AAAI.

[12]  Jian Xing,et al.  Effective Document Labeling with Very Few Seed Words: A Topic Model Approach , 2016, CIKM.

[13]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[14]  Geoffrey E. Hinton,et al.  Zero-shot Learning with Semantic Output Codes , 2009, NIPS.

[15]  Sutanu Chakraborti,et al.  Document classification by topic labeling , 2013, SIGIR.

[16]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[17]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[18]  Hema Raghavan,et al.  Active Learning with Feedback on Features and Instances , 2006, J. Mach. Learn. Res..

[19]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[20]  Ming-Wei Chang,et al.  Importance of Semantic Representation: Dataless Classification , 2008, AAAI.