Document filtering boosted by unlabeled data

This paper describes three learning methods for document filtering that use unlabeled data. The proposed methods are based on a committee of the classifiers which are trained on a small set of labeled data and then augmented by a large number of unlabeled data. By taking advantage of unlabeled data, the effective number of labeled data needed is significantly reduced and the filtering accuracy is increased. The use of unlabeled data is important because obtaining labeled data is difficult and time-consuming, while unlabeled data are abundant. For all proposed methods, the experimental results show that the accuracy is improved up to 9.2% with only two-thirds as many labeled data as the method which does not use unlabeled data.