Clustering support vector machines for unlabeled data classification

Clustering support vector machines (CSVM) is proposed in this paper for unlabeled data classification. It is often for us to deal with a large number of data which are wholly unlabeled, e.g., classifying them, and it is impractical for us to label these data manually. Clustering algorithms can be used to generate labels for this kind of data. The global k-means clustering algorithm, the fast global k-means algorithm and another global k-means clustering algorithm using k-d trees are combined respectively with the statistical method F-distribution in our paper to generate labels for those wholly unlabeled data, and then the labeled data are trained with SVM for classification. Our proposed approach (CSVM) is tested on four different synthetically generated data sets, which was wholly unlabeled. The experiment results show that our CSVM is efficient to classify the wholly unlabeled data.

[1]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[2]  Chih-Jen Lin,et al.  Combining SVMs with Various Feature Selection Strategies , 2006, Feature Extraction.

[3]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[4]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[5]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[6]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[7]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[8]  Yi Pan,et al.  Clustering support vector machines for protein local structure prediction , 2007, Expert Syst. Appl..

[9]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[10]  O. Mangasarian,et al.  Semi-Supervised Support Vector Machines for Unlabeled Data Classification , 2001 .

[11]  Nikos A. Vlassis,et al.  The global k-means clustering algorithm , 2003, Pattern Recognit..

[12]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[13]  Dinggang Shen,et al.  Design efficient support vector machine for fast classification , 2005, Pattern Recognit..

[14]  Mehmet Fatih Akay,et al.  Support vector machines combined with feature selection for breast cancer diagnosis , 2009, Expert Syst. Appl..

[15]  Stephen J. Redmond,et al.  A method for initialising the K-means clustering algorithm using kd-trees , 2007, Pattern Recognit. Lett..

[16]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[17]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[18]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.