Learning with Unlabeled Data for Text Categorization Using a Bootstrapping and a Feature Projection Technique

A wide range of supervised learning algorithms has been applied to Text Categorization. However, the supervised learning approaches have some problems. One of them is that they require a large, often prohibitive, number of labeled training documents for accurate learning. Generally, acquiring class labels for training data is costly, while gathering a large quantity of unlabeled data is cheap. We here propose a new automatic text categorization method for learning from only unlabeled data using a bootstrapping framework and a feature projection technique. From results of our experiments, our method showed reasonably comparable performance compared with a supervised method. If our method is used in a text categorization task, building text categorization systems will become significantly faster and less expensive.

[1]  Naftali Tishby,et al.  Unsupervised document classification using sequential information maximization , 2002, SIGIR '02.

[2]  Yiming Yang,et al.  A Study of Approaches to Hypertext Categorization , 2002, Journal of Intelligent Information Systems.

[3]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[4]  Youngjoong Ko,et al.  Using the feature projection technique based on a normalized voting method for text classification , 2004, Inf. Process. Manag..

[5]  James P. Callan,et al.  Training algorithms for linear text classifiers , 1996, SIGIR '96.

[6]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[7]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[8]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[9]  Tom M. Mitchell,et al.  Learning to construct knowledge bases from the World Wide Web , 2000, Artif. Intell..

[10]  Tom M. Mitchell,et al.  Using unlabeled data to improve text classification , 2001 .

[11]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[12]  Shimon Edelman,et al.  Similarity-based Word Sense Disambiguation , 1998, CL.

[13]  Youngjoong Ko,et al.  Text Categorization using Feature Projections , 2002, COLING.

[14]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[15]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[16]  Youngjoong Ko,et al.  Automatic Text Categorization by Unsupervised Learning , 2000, COLING.