Improving Semi-Supervised Support Vector Machines Through Unlabeled Instances Selection

Semi-supervised support vector machines (S3VMs) are a kind of popular approaches which try to improve learning performance by exploiting unlabeled data. Though S3VMs have been found helpful in many situations, they may degenerate performance and the resultant generalization ability may be even worse than using the labeled data only. In this paper, we try to reduce the chance of performance degeneration of S3VMs. Our basic idea is that, rather than exploiting all unlabeled data, the unlabeled instances should be selected such that only the ones which are very likely to be helpful are exploited, while some highly risky unlabeled instances are avoided. We propose the S3VM-us method by using hierarchical clustering to select the unlabeled instances. Experiments on a broad range of data sets over eighty-eight different settings show that the chance of performance degeneration of S3VM-us is much smaller than that of existing S3VMs.

[1]  Fabio Gagliardi Cozman,et al.  Semi-Supervised Learning of Mixture Models , 2003, ICML.

[2]  Shih-Fu Chang,et al.  Graph transduction via alternating minimization , 2008, ICML '08.

[3]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[4]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[5]  Maria-Florina Balcan,et al.  Person Identification in Webcam Images: An Application of Semi-Supervised Learning , 2005 .

[6]  S. Sathiya Keerthi,et al.  Optimization Techniques for Semi-Supervised Support Vector Machines , 2008, J. Mach. Learn. Res..

[7]  Jason Weston,et al.  Large Scale Transductive SVMs , 2006, J. Mach. Learn. Res..

[8]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[11]  Zhi-Hua Zhou,et al.  Semi-supervised learning by disagreement , 2010, Knowledge and Information Systems.

[12]  Zhi-Hua Zhou,et al.  SETRED: Self-training with Editing , 2005, PAKDD.

[13]  Zhi-Hua Zhou,et al.  Analyzing Co-training Style Algorithms , 2007, ECML.

[14]  Sanjoy Dasgupta,et al.  PAC Generalization Bounds for Co-training , 2001, NIPS.

[15]  S. Sathiya Keerthi,et al.  Branch and Bound for Semi-Supervised Support Vector Machines , 2006, NIPS.

[16]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[17]  Nello Cristianini,et al.  Convex Methods for Transduction , 2003, NIPS.

[18]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[19]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[20]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[21]  Zhi-Hua Zhou,et al.  Semi-supervised learning using label mean , 2009, ICML '09.

[22]  Shai Ben-David,et al.  Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning , 2008, COLT.

[23]  Junhui Wang,et al.  Large Margin Semi-supervised Learning , 2007, J. Mach. Learn. Res..

[24]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[25]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[26]  Robert D. Nowak,et al.  Unlabeled data: Now it helps, now it doesn't , 2008, NIPS.

[27]  Zhi-Hua Zhou,et al.  Tri-training: exploiting unlabeled data using three classifiers , 2005, IEEE Transactions on Knowledge and Data Engineering.

[28]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[29]  Shih-Fu Chang,et al.  Graph construction and b-matching for semi-supervised learning , 2009, ICML '09.

[30]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[31]  Larry A. Wasserman,et al.  Statistical Analysis of Semi-Supervised Regression , 2007, NIPS.

[32]  Zhi-Hua Zhou,et al.  A New Analysis of Co-Training , 2010, ICML.

[33]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[34]  Bernhard Schölkopf,et al.  Learning with Local and Global Consistency , 2003, NIPS.