Data Clustering with Partial Supervision

Clustering with partial supervision finds its application in situations where data is neither entirely nor accurately labeled. This paper discusses a semi-supervised clustering algorithm based on a modified version of the fuzzy C-Means (FCM) algorithm. The objective function of the proposed algorithm consists of two components. The first concerns traditional unsupervised clustering while the second tracks the relationship between classes (available labels) and the clusters generated by the first component. The balance between the two components is tuned by a scaling factor. Comprehensive experimental studies are presented. First, the discrimination of the proposed algorithm is discussed before its reformulation as a classifier is addressed. The induced classifier is evaluated on completely labeled data and validated by comparison against some fully supervised classifiers, namely support vector machines and neural networks. This classifier is then evaluated and compared against three semi-supervised algorithms in the context of learning from partly labeled data. In addition, the behavior of the algorithm is discussed and the relation between classes and clusters is investigated using a linear regression model. Finally, the complexity of the algorithm is briefly discussed.

[1]  Massih-Reza Amini,et al.  Semi-Supervised Learning with Explicit Misclassification Modeling , 2003, IJCAI.

[2]  William G. Marchal,et al.  Statistics: An Introduction , 1983 .

[3]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[4]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[5]  Ralf Klinkenberg,et al.  Using Labeled and Unlabeled Data to Learn Drifting Concepts , 2007 .

[6]  Isabelle Guyon,et al.  Discovering Informative Patterns and Data Cleaning , 1996, Advances in Knowledge Discovery and Data Mining.

[7]  James C. Bezdek,et al.  Generalized fuzzy c-means clustering strategies using Lp norm distances , 2000, IEEE Trans. Fuzzy Syst..

[8]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[9]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[10]  Abdelhamid Bouchachia RBF Networks for Learning from Partially Labeled Data , .

[11]  John D. Lafferty,et al.  Semi-supervised learning using randomized mincuts , 2004, ICML.

[12]  G. W. Snedecor Statistical Methods , 1964 .

[13]  Bernhard Schölkopf,et al.  Cluster Kernels for Semi-Supervised Learning , 2002, NIPS.

[14]  Zoubin Ghahramani,et al.  Nonparametric Transforms of Graph Kernels for Semi-Supervised Learning , 2004, NIPS.

[15]  Nicolino J. Pizzi,et al.  Fuzzy pre-processing of gold standards as applied to biomedical spectra classification , 1999, Artif. Intell. Medicine.

[16]  David A. Landgrebe,et al.  Partially supervised classification using weighted unsupervised clustering , 1999, IEEE Trans. Geosci. Remote. Sens..

[17]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[18]  Johan A. K. Suykens,et al.  Least Squares Support Vector Machine Classifiers , 1999, Neural Processing Letters.

[19]  Witold Pedrycz,et al.  Fuzzy clustering with partial supervision , 1997, IEEE Trans. Syst. Man Cybern. Part B.

[20]  Abdelhamid Bouchachia Learning with hybrid data , 2005, Fifth International Conference on Hybrid Intelligent Systems (HIS'05).

[21]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[22]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.