Semi-supervised learning using multiple clusterings with limited labeled data

Supervised classification consists in learning a predictive model using a set of labeled samples. It is accepted that predictive models accuracy usually increases as more labeled samples are available. Labeled samples are generally difficult to obtain as the labeling step if often performed manually. On the contrary, unlabeled samples are easily available. As the labeling task is tedious and time consuming, users generally provide a very limited number of labeled objects. However, designing approaches able to work efficiently with a very limited number of labeled samples is highly challenging. In this context, semi-supervised approaches have been proposed to leverage from both labeled and unlabeled data.In this paper, we focus on cases where the number of labeled samples is very limited. We review and formalize eight semi-supervised learning algorithms and introduce a new method that combine supervised and unsupervised learning in order to use both labeled and unlabeled data. The main idea of this method is to produce new features derived from a first step of data clustering. These features are then used to enrich the description of the input data leading to a better use of the data distribution. The efficiency of all the methods is compared on various artificial, UCI datasets, and on the classification of a very high resolution remote sensing image. The experiments reveal that our method shows good results, especially when the number of labeled sample is very limited. It also confirms that combining labeled and unlabeled data is very useful in pattern recognition.

[1]  Rama Chellappa,et al.  Non-linear dictionary learning with partially labeled data , 2015, Pattern Recognit..

[2]  Ayhan Demiriz,et al.  Exploiting unlabeled data in ensemble methods , 2002, KDD.

[3]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[4]  Qiang Yang,et al.  Semi-Supervised Learning with Very Few Labeled Training Examples , 2007, AAAI.

[5]  Zhi-Hua Zhou,et al.  When semi-supervised learning meets ensemble learning , 2009, MCS.

[6]  Germain Forestier,et al.  Collaborative clustering with background knowledge , 2010, Data Knowl. Eng..

[7]  John A. Richards,et al.  Cluster-space representation for hyperspectral data classification , 2002, IEEE Trans. Geosci. Remote. Sens..

[8]  Ashish Ghosh,et al.  A novel approach for change detection of remotely sensed images using semi-supervised multiple classifier system , 2014, Inf. Sci..

[9]  Ravi Kothari,et al.  Learning from labeled and unlabeled data using a minimal number of queries , 2003, IEEE Trans. Neural Networks.

[10]  Joydeep Ghosh,et al.  A framework for simultaneous co-clustering and learning from complex data , 2007, KDD '07.

[11]  Yan Zhou,et al.  Enhancing Supervised Learning with Unlabeled Data , 2000, ICML.

[12]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[13]  Jane You,et al.  Double Selection Based Semi-Supervised Clustering Ensemble for Tumor Clustering from Gene Expression Profiles , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[14]  Adam Kowalczyk,et al.  Combining clustering and co-training to enhance text classification using unlabelled data , 2002, KDD.

[15]  Bogdan Gabrys,et al.  Combining labelled and unlabelled data in the design of pattern classification systems , 2004, Int. J. Approx. Reason..

[16]  Germain Forestier,et al.  Supervised image segmentation using watershed transform, fuzzy classification and evolutionary computation , 2010, Pattern Recognit. Lett..

[17]  Zhi-Hua Zhou When semi-supervised learning meets ensemble learning , 2011 .

[18]  Germain Forestier,et al.  Towards conflict resolution in collaborative clustering , 2010, 2010 5th IEEE International Conference Intelligent Systems.

[19]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[20]  Ludmila I. Kuncheva,et al.  Moderate diversity for better cluster ensembles , 2006, Inf. Fusion.

[21]  Zhiwen Yu,et al.  Hybrid Adaptive Classifier Ensemble , 2015, IEEE Transactions on Cybernetics.

[22]  Germain Forestier,et al.  An Evolutionary Approach for Ontology Driven Image Interpretation , 2008, EvoWorkshops.

[23]  Ludmila I. Kuncheva,et al.  Experimental Comparison of Cluster Ensemble Methods , 2006, 2006 9th International Conference on Information Fusion.

[24]  Joydeep Ghosh,et al.  Adaptive Feature Spaces For Land Cover Classification With Limited Ground Truth Data , 2004, Int. J. Pattern Recognit. Artif. Intell..

[25]  Peter Meer,et al.  Semi-Supervised Kernel Mean Shift Clustering , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Fuzhen Zhuang,et al.  Combining Supervised and Unsupervised Models via Unconstrained Probabilistic Embedding , 2011, IJCAI.

[27]  Hareton K. N. Leung,et al.  Incremental Semi-Supervised Clustering Ensemble for High Dimensional Data Clustering , 2016, IEEE Transactions on Knowledge and Data Engineering.

[28]  Germain Forestier,et al.  Ontology-Based Object Recognition for Remote Sensing Image Interpretation , 2007 .

[29]  David A. Landgrebe,et al.  The effect of unlabeled samples in reducing the small sample size problem and mitigating the Hughes phenomenon , 1994, IEEE Trans. Geosci. Remote. Sens..

[30]  David A. Landgrebe,et al.  Covariance Matrix Estimation and Classification With Limited Training Data , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[32]  Ana L. N. Fred,et al.  Combining multiple clusterings using evidence accumulation , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[33]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[34]  Junjie Wu,et al.  Spectral Ensemble Clustering , 2015, KDD.

[35]  Abhinav Gupta,et al.  Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes , 2012, ECCV.

[36]  Christoph F. Eick,et al.  Supervised clustering - algorithms and benefits , 2004, 16th IEEE International Conference on Tools with Artificial Intelligence.

[37]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[38]  Abdelhamid Bouchachia,et al.  Learning with partly labeled data , 2007, Neural Computing and Applications.

[39]  Geoffrey I. Webb,et al.  Dynamic Time Warping Averaging of Time Series Allows Faster and More Accurate Classification , 2014, 2014 IEEE International Conference on Data Mining.

[40]  Michael I. Jordan,et al.  Supervised learning from incomplete data via an EM approach , 1993, NIPS.

[41]  Xuran Zhao,et al.  A subspace co-training framework for multi-view clustering , 2014, Pattern Recognit. Lett..

[42]  Cheng Wu,et al.  Semi-Supervised and Unsupervised Extreme Learning Machines , 2014, IEEE Transactions on Cybernetics.

[43]  Germain Forestier,et al.  Knowledge-based region labeling for remote sensing image interpretation , 2012, Comput. Environ. Urban Syst..

[44]  Daoqiang Zhang,et al.  A simultaneous learning framework for clustering and classification , 2009, Pattern Recognit..

[45]  G. F. Hughes,et al.  On the mean accuracy of statistical pattern recognizers , 1968, IEEE Trans. Inf. Theory.

[46]  Ludmila I. Kuncheva,et al.  Using diversity in cluster ensembles , 2004, 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583).

[47]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[48]  Fan Yang,et al.  Exploring the diversity in cluster ensemble generation: Random sampling and random projection , 2014, Expert Syst. Appl..

[49]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.