Partially Supervised Classification of Text Documents

We investigate the following problem: Given a set of documents of a particular topic or class P , and a large set M of mixed documents that contains documents from class P and other types of documents, identify the documents from class P in M . The key feature of this problem is that there is no labeled nonP document, which makes traditional machine learning techniques inapplicable, as they all need labeled documents of both classes. We call this problem partially supervised classification. In this paper, we show that this problem can be posed as a constrained optimization problem and that under appropriate conditions, solutions to the constrained optimization problem will give good solutions to the partially supervised classification problem. We present a novel technique to solve the problem and demonstrate the effectiveness of the technique through extensive experimentation.

[1]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[2]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[3]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .

[4]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[5]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Peter Bollmann-Sdorra,et al.  Measurement-theoretical investigation of the MZ-metric , 1980, SIGIR '80.

[8]  William M. Shaw,et al.  On the foundation of evaluation , 1986, J. Am. Soc. Inf. Sci..

[9]  G Salton,et al.  Developments in Automatic Text Retrieval , 1991, Science.

[10]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[11]  Michael Kearns,et al.  Efficient noise-tolerant learning from statistical queries , 1993, STOC.

[12]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[13]  David D. Lewis,et al.  A comparison of two learning algorithms for text categorization , 1994 .

[14]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[15]  Stephen Muggleton,et al.  Learning from Positive Data , 1996, Inductive Logic Programming Workshop.

[16]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[17]  Frann Cois Denis,et al.  PAC Learning from Positive Statistical Queries , 1998, ALT.

[18]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[19]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[20]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[21]  Rémi Gilleron,et al.  Learning from positive and unlabeled examples , 2000, Theor. Comput. Sci..

[22]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.