Learning to Identify Unexpected Instances in the Test Set

Traditional classification involves building a classifier using labeled training examples from a set of predefined classes and then applying the classifier to classify test instances into the same set of classes. In practice, this paradigm can be problematic because the test data may contain instances that do not belong to any of the previously defined classes. Detecting such unexpected instances in the test set is an important issue in practice. The problem can be formulated as learning from positive and unlabeled examples (PU learning). However, current PU learning algorithms require a large proportion of negative instances in the unlabeled set to be effective. This paper proposes a novel technique to solve this problem in the text classification domain. The technique first generates a single artificial negative document AN. The sets P and {AN} are then used to build a naive Bayesian classifier. Our experiment results show that this method is significantly better than existing techniques.

[1]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[2]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[3]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[4]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[5]  Koby Crammer,et al.  A needle in a haystack: local one-class optimization , 2004, ICML.

[6]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[7]  Stephen Muggleton,et al.  Learning from Positive Data , 1996, Inductive Logic Programming Workshop.

[8]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[9]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[10]  Malik Yousef,et al.  One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[11]  Rémi Gilleron,et al.  Text Classification from Positive and Unlabeled Examples , 2002 .

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[14]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[15]  Philip S. Yu,et al.  Text classification without labeled negative documents , 2005, 21st International Conference on Data Engineering (ICDE'05).