Semi-supervised text classification from unlabeled documents using class associated words

Automatically classifying text documents is an important field in machine learning. Unsupervised text classification does not need training data but is often criticized to cluster blindly. Supervised text classification needs large quantities of labeled training data to achieve high accuracy. However, in practice, labeled samples are often difficult, expensive or time consuming to obtain. In the meanwhile, unlabeled documents can be collected easily owing to the rapid developing Internet. Class associated words are the words which represent the subject of classes and provide prior knowledge of classification for training a classifier. A learning algorithm, based on the combination of Expectation-Maximization (EM) and a Naïve Bayes classifier, is introduced to classify documents from fully unlabeled documents using class associated words. Experimental results show that it has good classification capability with high accuracy, especially for those categories with small quantities of samples. In the algorithm, class associated words are used to set classification constraints during learning process to restrict to classify documents into corresponding class labels and improve the classification accuracy.

[1]  Su Jin-shu Semi-supervised Text Classification Based on Self-training EM Algorithm , 2007 .

[2]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[3]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[4]  Pan Xiu-qin Semi-supervised Active DBN Learning Algorithm Based on EM and Classification Loss , 2007 .

[5]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[6]  Dursun Delen,et al.  Seeding the survey and analysis of research literature with text mining , 2008, Expert Syst. Appl..

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[9]  Toshinori Munakata,et al.  Knowledge discovery , 1999, Commun. ACM.

[10]  Shi Zhongzhi,et al.  Web mining based on Bayes latent semantic model , 2001, 2001 International Conferences on Info-Tech and Info-Net. Proceedings (Cat. No.01EX479).

[11]  Bernhard Schölkopf,et al.  Introduction to Semi-Supervised Learning , 2006, Semi-Supervised Learning.

[12]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[13]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[14]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[15]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .