Some contributions to semi-supervised learning

Semi-supervised learning methods attempt to improve the performance of a supervised or an unsupervised learner in the presence of "side information". This side information can be in the form of unlabeled samples in the supervised case or pair-wise constraints in the unsupervised case. Most existing semi-supervised learning approaches design a new objective function, which in turn leads to a new algorithm rather than improving the performance of an already available learner. In this thesis, the three classical problems in pattern recognition and machine learning, namely, classification, clustering, and unsupervised feature selection, are extended to their semi-supervised counterparts. Our first contribution is an algorithm that utilizes unlabeled data along with the labeled data while training classifiers. Unlike previous approaches that design specialized algorithms to effectively exploit the labeled and unlabeled data, we design a meta-semi-supervised learning algorithm called SemiBoost, which wraps around the underlying supervised algorithm and improve its performance using the unlabeled data and a similarity function. Empirical evaluation on several standard datasets shows a significant improvement in the performance of well-known classifiers (decision stump, decision tree, and SVM). In the second part of this thesis, we address the problem of designing a mixture model for data clustering that can effectively utilize "side-information" in the form of pair-wise constraints. Popular mixture models or related algorithms (K-means, Gaussian mixture models) are too rigid (strong model assumptions) to result in different cluster partitions by utilizing the side-information. We propose a non-parametric mixture model for data clustering in order to be flexible enough to detect arbitrarily shaped clusters. Kernel density estimates are used to fit the density of each cluster. The clustering algorithm was tested on a text clustering application, and performance superior to popular clustering algorithms was observed. Pair-wise constraints ("must-link" and "cannot-link" constraints) are used to select key parameters of the algorithm. The third part of this thesis focuses on performing feature selection from unlabeled data using instance level pair-wise constraints (i.e., a pair of examples labeled as must-link pair or cannot-link pair). Using the dual-gradient descent method, we designed an efficient online algorithm. Pair-wise constraints are incorporated into the feature selection stage, providing the user with flexibility to use unsupervised or semi-supervised algorithms at a later stage. The approach was evaluated on the task of image clustering. Using pair-wise constraints, the number of features was reduced by around 80%, usually with little or no degradation in the clustering performance on all the datasets, and with substantial improvement in the clustering performance on some datasets.

[1]  H. Zou,et al.  The F ∞ -norm support vector machine , 2008 .

[2]  Y. Freund Boosting a Weak Learning Algorithm by Majority to Be Published in Information and Computation , 1995 .

[3]  Vittorio Castelli,et al.  On the exponential value of labeled samples , 1995, Pattern Recognit. Lett..

[4]  Yi Liu,et al.  An Efficient Algorithm for Local Distance Metric Learning , 2006, AAAI.

[5]  John Langford,et al.  An objective evaluation criterion for clustering , 2004, KDD.

[6]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[7]  John W. Tukey,et al.  Exploratory Data Analysis. , 1979 .

[8]  Neil D. Lawrence,et al.  Semi-supervised Learning via Gaussian Processes , 2004, NIPS.

[9]  Ayhan Demiriz,et al.  Exploiting unlabeled data in ensemble methods , 2002, KDD.

[10]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[11]  O. Mangasarian,et al.  Semi-Supervised Support Vector Machines for Unlabeled Data Classification , 2001 .

[12]  Anil K. Jain,et al.  Ethnicity identification from face images , 2004, SPIE Defense + Commercial Sensing.

[13]  Joachim M. Buhmann,et al.  A maximum entropy approach to pairwise data clustering , 1994, Proceedings of the 12th IAPR International Conference on Pattern Recognition, Vol. 3 - Conference C: Signal Processing (Cat. No.94CH3440-5).

[14]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[15]  P. Bühlmann,et al.  The group lasso for logistic regression , 2008 .

[16]  Stephen J. Roberts,et al.  Minimum-Entropy Data Clustering Using Reversible Jump Markov Chain Monte Carlo , 2001, ICANN.

[17]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[18]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[19]  M. Yuan,et al.  Model selection and estimation in regression with grouped variables , 2006 .

[20]  Joachim M. Buhmann,et al.  Learning with constrained and unlabelled data , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[21]  Stephen J. Roberts,et al.  Minimum-Entropy Data Partitioning Using Reversible Jump Markov Chain Monte Carlo , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Maria-Florina Balcan,et al.  A PAC-Style Model for Learning from Labeled and Unlabeled Data , 2005, COLT.

[23]  Anil K. Jain Data clustering: 50 years beyond K-means , 2008, Pattern Recognit. Lett..

[24]  Charles T. Zahn,et al.  and Describing GestaltClusters , 1971 .

[25]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[26]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[27]  Douglas H. Fisher,et al.  Knowledge Acquisition Via Incremental Conceptual Clustering , 1987, Machine Learning.

[28]  Meirav Galun,et al.  Fundamental Limitations of Spectral Clustering , 2006, NIPS.

[29]  Zoubin Ghahramani,et al.  Combining active learning and semi-supervised learning using Gaussian fields and harmonic functions , 2003, ICML 2003.

[30]  Katherine A. Heller,et al.  Bayesian hierarchical clustering , 2005, ICML.

[31]  Vladimir Vapnik Estimations of dependences based on statistical data , 1982 .

[32]  Bernhard Schölkopf,et al.  Learning from labeled and unlabeled data on a directed graph , 2005, ICML.

[33]  Zhengdong Lu,et al.  Semi-supervised Learning with Penalized Probabilistic Clustering , 2004, NIPS.

[34]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[35]  Tomer Hertz,et al.  Learning a Mahalanobis Metric from Equivalence Constraints , 2005, J. Mach. Learn. Res..

[36]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[37]  Fatih Murat Porikli,et al.  Kernel methods for weakly supervised mean shift clustering , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[38]  Jacob Goldberger,et al.  Hierarchical Clustering of a Mixture Model , 2004, NIPS.

[39]  P. Zhao Boosted Lasso , 2004 .

[40]  Kilian Q. Weinberger,et al.  Distance Metric Learning for Large Margin Nearest Neighbor Classification , 2005, NIPS.

[41]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[42]  Yoram Singer,et al.  Online and batch learning of pseudo-metrics , 2004, ICML.

[43]  Ian Davidson,et al.  Constrained Clustering: Advances in Algorithms, Theory, and Applications , 2008 .

[44]  Hiroshi Motoda,et al.  Computational Methods of Feature Selection , 2022 .

[45]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[46]  Ulrik Brandes,et al.  Experiments on Graph Clustering Algorithms , 2003, ESA.

[47]  Alexander Zien,et al.  Semi-Supervised Classification by Low Density Separation , 2005, AISTATS.

[48]  Shai Ben-David,et al.  Does Unlabeled Data Provably Help? Worst-case Analysis of the Sample Complexity of Semi-Supervised Learning , 2008, COLT.

[49]  Chong-Wah Ngo,et al.  Evaluating bag-of-visual-words representations in scene classification , 2007, MIR '07.

[50]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[51]  H. J. Scudder,et al.  Probability of error of some adaptive pattern-recognition machines , 1965, IEEE Trans. Inf. Theory.

[52]  Hans-Peter Kriegel,et al.  Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications , 1998, Data Mining and Knowledge Discovery.

[53]  Anil K. Jain,et al.  Model-based Clustering With Probabilistic Constraints , 2005, SDM.

[54]  Michael Isard,et al.  Object retrieval with large vocabularies and fast spatial matching , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  David J. Miller,et al.  A Mixture of Experts Classifier with Learning Based on Both Labelled and Unlabelled Data , 1996, NIPS.

[56]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[57]  Rong Jin,et al.  Learning distance metrics for interactive search-assisted diagnosis of mammograms , 2007, SPIE Medical Imaging.

[58]  Robert Tibshirani,et al.  1-norm Support Vector Machines , 2003, NIPS.

[59]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[60]  Pasi Fränti,et al.  Web Data Mining , 2009, Encyclopedia of Database Systems.

[61]  Jianbo Shi,et al.  Multiclass spectral clustering , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[62]  Alexander Zien,et al.  Label Propagation and Quadratic Criterion , 2006 .

[63]  Hiroshi Motoda,et al.  Book Review: Computational Methods of Feature Selection , 2007, The IEEE intelligent informatics bulletin.

[64]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .

[65]  George L. Nemhauser,et al.  Min-cut clustering , 1993, Math. Program..

[66]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[67]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[68]  Inderjit S. Dhillon,et al.  Information-theoretic metric learning , 2006, ICML '07.

[69]  Geoffrey J. McLachlan,et al.  Mixture models : inference and applications to clustering , 1989 .

[70]  Anil K. Jain,et al.  Unsupervised texture segmentation using Gabor filters , 1990, 1990 IEEE International Conference on Systems, Man, and Cybernetics Conference Proceedings.

[71]  David Yarowsky,et al.  Word Sense Disambiguation , 2010, Handbook of Natural Language Processing.

[72]  Alexander Zien,et al.  Semi-Supervised Learning , 2006 .

[73]  Jianbo Shi,et al.  Grouping with Directed Relationships , 2001, EMMCVPR.

[74]  Joshua B. Tenenbaum,et al.  Global Versus Local Methods in Nonlinear Dimensionality Reduction , 2002, NIPS.

[75]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[76]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[77]  R. Jarvis,et al.  ClusteringUsing a Similarity Measure Based on SharedNear Neighbors , 1973 .

[78]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[79]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[80]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[81]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[82]  Christophe Ambroise,et al.  Semi-supervised MarginBoost , 2001, NIPS.

[83]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[84]  V. Vapnik Estimation of Dependences Based on Empirical Data , 2006 .

[85]  David A. Forsyth,et al.  Clustering art , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[86]  Xiaojin Zhu,et al.  Humans Perform Semi-Supervised Classification Too , 2007, AAAI.

[87]  Sebastian Thrun,et al.  Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.

[88]  David A. Forsyth,et al.  Object Recognition as Machine Translation: Learning a Lexicon for a Fixed Image Vocabulary , 2002, ECCV.

[89]  S T Roweis,et al.  Nonlinear dimensionality reduction by locally linear embedding. , 2000, Science.

[90]  Joydeep Ghosh,et al.  Scalable Clustering Algorithms with Balancing Constraints , 2006, Data Mining and Knowledge Discovery.

[91]  Zoubin Ghahramani,et al.  Learning from labeled and unlabeled data with label propagation , 2002 .

[92]  T. Minka Expectation-Maximization as lower bound maximization , 1998 .

[93]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[94]  Richard M. Leahy,et al.  An Optimal Graph Theoretic Approach to Data Clustering: Theory and Its Application to Image Segmentation , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[95]  Hyunsoo Kim,et al.  Dimension Reduction in Text Classification with Support Vector Machines , 2005, J. Mach. Learn. Res..

[96]  Zhi-Hua Zhou,et al.  Exploiting Unlabeled Data in Content-Based Image Retrieval , 2004, ECML.

[97]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[98]  Tommi S. Jaakkola,et al.  Partially labeled classification with Markov random walks , 2001, NIPS.

[99]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[100]  H. Robbins A Stochastic Approximation Method , 1951 .

[101]  Tomer Hertz,et al.  Learning Distance Functions using Equivalence Relations , 2003, ICML.

[102]  Inderjit S. Dhillon,et al.  Kernel k-means: spectral clustering and normalized cuts , 2004, KDD.

[103]  Jin Hyeong Park,et al.  Spectral Clustering for Robust Motion Segmentation , 2004, ECCV.

[104]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[105]  Nello Cristianini,et al.  Large Margin DAGs for Multiclass Classification , 1999, NIPS.

[106]  Andrew Zisserman,et al.  Video Google: a text retrieval approach to object matching in videos , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[107]  David Nistér,et al.  Scalable Recognition with a Vocabulary Tree , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[108]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[109]  Huan Liu,et al.  Feature Selection for Classification , 1997, Intell. Data Anal..

[110]  Anil K. Jain,et al.  Clustering, dimensionality reduction, and side information , 2006 .

[111]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[112]  Fabio Gagliardi Cozman,et al.  Semi-supervised Learning of Classifiers : Theory , Algorithms and Their Application to Human-Computer Interaction , 2004 .

[113]  Keinosuke Fukunaga,et al.  Statistical Pattern Recognition , 1993, Handbook of Pattern Recognition and Computer Vision.

[114]  Thorsten Joachims,et al.  Transductive Learning via Spectral Graph Partitioning , 2003, ICML.

[115]  Maria-Florina Balcan,et al.  Person Identification in Webcam Images: An Application of Semi-Supervised Learning , 2005 .

[116]  Naftali Tishby,et al.  Agglomerative Information Bottleneck , 1999, NIPS.

[117]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[118]  S. Roberts,et al.  Minimum entropy data partitioning , 1999 .

[119]  Robert E. Schapire,et al.  How boosting the margin can also boost classifier complexity , 2006, ICML.

[120]  Douglas Hayes Fisher,et al.  Knowledge acquisition via incremental conceptual clustering : a dussertation submitted in partial satisfaction of the requirements for the degree doctor of philosophy in information and computer science , 1987 .

[121]  Mikhail Belkin,et al.  Manifold Regularization : A Geometric Framework for Learning from Examples , 2004 .

[122]  Yi Liu,et al.  Semi-supervised Multi-label Learning by Constrained Non-negative Matrix Factorization , 2006, AAAI.

[123]  Rong Jin,et al.  Learning nonparametric kernel matrices from pairwise constraints , 2007, ICML '07.

[124]  David G. Stork,et al.  Pattern Classification , 1973 .

[125]  P. Bartlett,et al.  Boosting Algorithms as Gradient Descent in Function , 1999 .

[126]  Bernhard Schölkopf,et al.  Ranking on Data Manifolds , 2003, NIPS.

[127]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[128]  Rong Jin,et al.  Rank-based distance metric learning: An application to image retrieval , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[129]  Tommi S. Jaakkola,et al.  Tutorial on variational approximation methods , 2000 .

[130]  Thorsten Joachims,et al.  Transductive Inference for Text Classification using Support Vector Machines , 1999, ICML.

[131]  Rong Jin,et al.  Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization , 2008, NIPS.

[132]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[133]  Tong Zhang,et al.  The Value of Unlabeled Data for Classification Problems , 2000, ICML 2000.

[134]  Zhi-Hua Zhou,et al.  Enhancing relevance feedback in image retrieval using unlabeled data , 2006, ACM Trans. Inf. Syst..

[135]  Xiaojin Zhu,et al.  --1 CONTENTS , 2006 .

[136]  Pietro Perona,et al.  Non-Parametric Probabilistic Image Segmentation , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[137]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[138]  Thomas L. Griffiths,et al.  Hierarchical Topic Models and the Nested Chinese Restaurant Process , 2003, NIPS.

[139]  Tomer Hertz,et al.  Computing Gaussian Mixture Models with EM Using Equivalence Constraints , 2003, NIPS.

[140]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[141]  Luc Van Gool,et al.  The Pascal Visual Object Classes (VOC) Challenge , 2010, International Journal of Computer Vision.

[142]  S. Dongen Performance criteria for graph clustering and Markov cluster experiments , 2000 .

[143]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[144]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[145]  Arindam Banerjee,et al.  Active Semi-Supervision for Pairwise Constrained Clustering , 2004, SDM.

[146]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[147]  Anil K. Jain,et al.  Bayesian Feedback in Data Clustering , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[148]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[149]  Dan Klein,et al.  Spectral Learning , 2003, IJCAI.

[150]  Dorin Comaniciu,et al.  Mean Shift: A Robust Approach Toward Feature Space Analysis , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[151]  Inderjit S. Dhillon,et al.  Online Metric Learning and Fast Similarity Search , 2008, NIPS.

[152]  Hui Zou,et al.  NORM SUPPORT VECTOR MACHINE , 2008 .

[153]  Zhenguo Li,et al.  Noise Robust Spectral Clustering , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[154]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[155]  Ian Davidson,et al.  Measuring Constraint-Set Utility for Partitional Clustering Algorithms , 2006, PKDD.

[156]  Andrew B. Kahng,et al.  New spectral methods for ratio cut partitioning and clustering , 1991, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[157]  Martial Hebert,et al.  Semi-Supervised Self-Training of Object Detection Models , 2005, 2005 Seventh IEEE Workshops on Applications of Computer Vision (WACV/MOTION'05) - Volume 1.

[158]  Avrim Blum,et al.  Learning from Labeled and Unlabeled Data using Graph Mincuts , 2001, ICML.

[159]  Geoffrey E. Hinton,et al.  Neighbourhood Components Analysis , 2004, NIPS.

[160]  M. Seeger Learning with labeled and unlabeled dataMatthias , 2001 .

[161]  Geoffrey H. Ball,et al.  ISODATA, A NOVEL METHOD OF DATA ANALYSIS AND PATTERN CLASSIFICATION , 1965 .

[162]  David G. Lowe,et al.  Fast Approximate Nearest Neighbors with Automatic Algorithm Configuration , 2009, VISAPP.

[163]  Fabio Gagliardi Cozman,et al.  Unlabeled Data Can Degrade Classification Performance of Generative Classifiers , 2002, FLAIRS.

[164]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[165]  Yi Liu,et al.  SemiBoost: Boosting for Semi-Supervised Learning , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[166]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[167]  Ayhan Demiriz,et al.  Semi-Supervised Support Vector Machines , 1998, NIPS.

[168]  David J. Miller,et al.  Mixture Modeling with Pairwise, Instance-Level Class Constraints , 2005, Neural Computation.

[169]  Rong Jin,et al.  Active query selection for semi-supervised clustering , 2008, 2008 19th International Conference on Pattern Recognition.

[170]  Rong Jin,et al.  Semi-Supervised Boosting for Multi-Class Classification , 2008, ECML/PKDD.

[171]  John Shawe-Taylor,et al.  A Framework for Probability Density Estimation , 2007, AISTATS.

[172]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[173]  Naftali Tishby,et al.  Data Clustering by Markovian Relaxation and the Information Bottleneck Method , 2000, NIPS.