Text classification without negative examples revisit

Traditionally, building a classifier requires two sets of examples: positive examples and negative examples. This paper studies the problem of building a text classifier using positive examples (P) and unlabeled examples (U). The unlabeled examples are mixed with both positive and negative examples. Since no negative example is given explicitly, the task of building a reliable text classifier becomes far more challenging. Simply treating all of the unlabeled examples as negative examples and building a classifier thereafter is undoubtedly a poor approach to tackling this problem. Generally speaking, most of the studies solved this problem by a two-step heuristic: first, extract negative examples (N) from U. Second, build a classifier based on P and N. Surprisingly, most studies did not try to extract positive examples from U. Intuitively, enlarging P by P' (positive examples extracted from U) and building a classifier thereafter should enhance the effectiveness of the classifier. Throughout our study, we find that extracting P' is very difficult. A document in U that possesses the features exhibited in P does not necessarily mean that it is a positive example, and vice versa. The very large size of and very high diversity in U also contribute to the difficulties of extracting P'. In this paper, we propose a labeling heuristic called PNLH to tackle this problem. PNLH aims at extracting high quality positive examples and negative examples from U and can be used on top of any existing classifiers. Extensive experiments based on several benchmarks are conducted. The results indicated that PNLH is highly feasible, especially in the situation where |P| is extremely small.

[1]  Yiming Yang,et al.  A study of thresholding strategies for text categorization , 2001, SIGIR '01.

[2]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[3]  David R. Karger,et al.  Scatter/Gather: a cluster-based approach to browsing large document collections , 1992, SIGIR '92.

[4]  Philip S. Yu Editorial: State of the Transactions , 2004, IEEE Trans. Knowl. Data Eng..

[5]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[6]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[7]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[8]  Philip S. Yu,et al.  Text classification without labeled negative documents , 2005, 21st International Conference on Data Engineering (ICDE'05).

[9]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[10]  Kevin Chen-Chuan Chang,et al.  PEBL: positive example based learning for Web page classification using SVM , 2002, KDD.

[11]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[12]  Bing Liu,et al.  Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression , 2003, ICML.

[13]  Herbert K. H. Lee,et al.  A Noninformative Prior for Neural Networks , 2004, Machine Learning.

[14]  Paul S. Bradley,et al.  Refining Initial Points for K-Means Clustering , 1998, ICML.

[15]  Mark Craven,et al.  Exploiting Relations Among Concepts to Acquire Weakly Labeled Training Data , 2002, International Conference on Machine Learning.

[16]  Jiawei Han,et al.  PEBL: Web page classification without negative examples , 2004, IEEE Transactions on Knowledge and Data Engineering.

[17]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[18]  Yiming Yang,et al.  A re-examination of text categorization methods , 1999, SIGIR '99.

[19]  Jiawei Han,et al.  Heterogeneous learner for Web page classification , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[20]  Melvin Alexander Applied Statistics and Probability for Engineers , 1995 .

[21]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[23]  Chinatsu Aone,et al.  Fast and effective text mining using linear-time document clustering , 1999, KDD '99.

[24]  Philip S. Yu,et al.  Text Classification by Labeling Words , 2004, AAAI.

[25]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[26]  Rayid Ghani,et al.  Combining Labeled and Unlabeled Data for MultiClass Text Categorization , 2002, ICML.

[27]  Philip S. Yu,et al.  Building text classifiers using positive and unlabeled examples , 2003, Third IEEE International Conference on Data Mining.

[28]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[29]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[30]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.