A novel semi supervised approach for text classification

Text categorization, also known as text classification is a supervised classification problem. It aims to assign a predefined class label or group to a new or unknown text document. Most of the time we need a collection of large data from each class to train the classifier. It may be noted that, it is very hard or expensive to collect labelled text data. In most cases we assign the label manually which is neither cost effective nor efficient. In this paper, we have introduced a semi-supervised classification approach where the learner needs very small amount of labelled data with a large amount of unlabeled data to assign a class label to a new or unknown text document. The proposed method uses Kohonen self organizing map (SOM) for labelling the unlabeled data and three classifiers namely support vector machine (SVM), Naïve Bayes (NB), and decision tree (DT): classification and regression tree (CART) for observing the accuracy of classification. The experimental results obtained show the effectiveness of our proposed method.

[1]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[2]  Peter Willett,et al.  Recent trends in hierarchic document clustering: A critical review , 1988, Inf. Process. Manag..

[3]  Robert Sabourin,et al.  “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? , 2006 .

[4]  Louise Guthrie,et al.  Document Classification By Machine: Theory and Practice , 1994, COLING.

[5]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[6]  Philip S. Yu,et al.  On using partial supervision for text categorization , 2004, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[8]  Stefan Wermter,et al.  Neural Network Agents for Learning Semantic Text Classification , 2000, Information Retrieval.

[9]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[10]  Dik Lun Lee,et al.  Feature reduction for neural network based text categorization , 1999, Proceedings. 6th International Conference on Advanced Systems for Advanced Applications.

[11]  Chih-Jen Lin,et al.  A Practical Guide to Support Vector Classication , 2008 .

[12]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[13]  Ali Selamat,et al.  Web news classification using neural networks based on PCA , 2002, Proceedings of the 41st SICE Annual Conference. SICE 2002..

[14]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[15]  Nirmalya Chowdhury,et al.  Unsupervised Text Classification Using Kohonen's Self Organizing Network , 2005, CICLing.

[16]  Stephen Grossberg,et al.  ARTMAP: supervised real-time learning and classification of nonstationary data by a self-organizing neural network , 1991, [1991 Proceedings] IEEE Conference on Neural Networks for Ocean Engineering.

[17]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[18]  Mehran Sahami,et al.  Learning Limited Dependence Bayesian Classifiers , 1996, KDD.

[19]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[20]  Sebastian Thrun,et al.  Learning to Classify Text from Labeled and Unlabeled Documents , 1998, AAAI/IAAI.

[21]  Kari Torkkola,et al.  Discriminative features for text document classification , 2003, Formal Pattern Analysis & Applications.

[22]  Wai Lam,et al.  Using a generalized instance set for automatic text categorization , 1998, SIGIR '98.

[23]  David D. Lewis,et al.  An evaluation of phrasal and clustered representations on a text categorization task , 1992, SIGIR '92.

[24]  Tom M. Mitchell,et al.  Improving Text Classification by Shrinkage in a Hierarchy of Classes , 1998, ICML.

[25]  George Karypis,et al.  Centroid-Based Document Classification: Analysis and Experimental Results , 2000, PKDD.

[26]  Padmini Srinivasan,et al.  Hierarchical neural networks for text categorization (poster abstract) , 1999, SIGIR '99.

[27]  Hang Li,et al.  Document Classification Using a Finite Mixture Model , 1997, ACL.

[28]  W. Bruce Croft,et al.  Combining classifiers in text categorization , 1996, SIGIR '96.

[29]  Fionn Murtagh,et al.  A Survey of Recent Advances in Hierarchical Clustering Algorithms , 1983, Comput. J..

[30]  Roberto Basili,et al.  A robust model for intelligent text classification , 2001, Proceedings 13th IEEE International Conference on Tools with Artificial Intelligence. ICTAI 2001.

[31]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[32]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[33]  Peter J. Rousseeuw,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1990 .

[34]  Martin Ester,et al.  Frequent term-based text clustering , 2002, KDD.

[35]  Yiming Yang,et al.  Expert network: effective and efficient learning from human decisions in text categorization and retrieval , 1994, SIGIR '94.

[36]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[37]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[38]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.