Clustering based two-stage text classification requiring minimal training data

Clustering aided classification methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the underlying distribution of the data set. However, our experiments show that whether such assumption holds is based on both the separability of the considered data set and the size of the training data set. It is often violated on data set of bad separability, especially when the initial training data are too few. In this case, clustering based methods would perform worse. In this paper, we propose a clustering based two-stage text classification approach to address the above problem. In the first stage, labeled and unlabeled data are first clustered with the guidance of the labeled data. Then a self-training style clustering strategy is used to iteratively expand the training data under the guidance of an oracle or expert. At the second stage, discriminative classifiers can subsequently be trained with the expanded labeled data set. Unlike other clustering based methods, the proposed clustering strategy can effectively cope with data of bad separability. Furthermore, our proposed framework converts the problem of sparsely labeled text classification into a supervised one, therefore, supervised classification models, e.g. SVM, can be applied, and techniques proposed for supervised learning can be used to further improve the classification accuracy, such as feature selection, sampling methods and data editing or noise filtering. Our experimental results demonstrated the effectiveness of our proposed approach especially when the size of the training data set is very small.

[1]  Daoqiang Zhang,et al.  A Multiobjective Simultaneous Learning Framework for Clustering and Classification , 2010, IEEE Transactions on Neural Networks.

[2]  Lawrence O. Hall,et al.  Text classification with enhanced semi-supervised fuzzy clustering , 2002, 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE'02. Proceedings (Cat. No.02CH37291).

[3]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[4]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[5]  Xue Zhang,et al.  Clustering based two-stage text classification requiring minimal training data , 2012, ICONS 2012.

[6]  O. Mangasarian,et al.  Semi-Supervised Support Vector Machines for Unlabeled Data Classification , 2001 .

[7]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[8]  Antonia Kyriakopoulou,et al.  Text Classification Aided by Clustering: a Literature Review , 2008 .

[9]  Antonia Kyriakopoulou Using Clustering and Co-Training to Boost Classification Performance , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[10]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[11]  Naftali Tishby,et al.  Document clustering using word clusters via the information bottleneck method , 2000, SIGIR '00.

[12]  Hongjun Lu,et al.  CBC: clustering based text classification requiring minimal labeled data , 2003, Third IEEE International Conference on Data Mining.

[13]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.

[14]  Adam Kowalczyk,et al.  Combining clustering and co-training to enhance text classification using unlabelled data , 2002, KDD.

[15]  T. Kalamboukis,et al.  Combining Clustering with Classification for Spam Detection in Social Bookmarking Systems ? , 2008 .

[16]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[17]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[18]  Matthias Seeger,et al.  Learning from Labeled and Unlabeled Data , 2010, Encyclopedia of Machine Learning.

[19]  Qiang Qian,et al.  Simultaneous clustering and classification over cluster structure representation , 2012, Pattern Recognit..

[20]  G. De Soete,et al.  Clustering and Classification , 2019, Data-Driven Science and Engineering.

[21]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[22]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[23]  Jiawei Han,et al.  Classifying large data sets using SVMs with hierarchical clusters , 2003, KDD '03.

[24]  NgHwee Tou,et al.  Feature selection, perceptron learning, and a usability case study for text categorization , 1997 .