Semi-supervised hybrid clustering by integrating Gaussian mixture model and distance metric learning

Semi-supervised clustering aim to aid and bias the unsupervised clustering by employing a small amount of supervised information. The supervised information is generally given as pairwise constraints, which was used to either modify the objective function or to learn the distance measure. Many previous work have shown that the cluster algorithm based on distance metric is significantly better than the cluster algorithm based on probability distribution in the some data set, there are a totally opposite result in another data set, so how to balance the two methods become a key problem. In this paper, we proposed a semi-supervised hybrid clustering algorithm that provides a principled framework integrating distance metric into Gaussian mixture model, which consider not only the intrinsic geometry information but also the probability distribution information of the data. In comparison to only using the pairwise constraints, the labeled data was used to initialize Gaussian distribution parameter and to construct the weight matrix of regularizer, and then we adopt Kullback-Leibler Divergence as the “distance” measurement to regularize the objective function. Experiments on several UCI data sets and the real world data sets of Chinese Word Sense Induction demonstrate the effectiveness of our semi-supervised cluster algorithm.

[1]  Howard J. Hamilton,et al.  A density-based spatial clustering for physical constraints , 2011, Journal of Intelligent Information Systems.

[2]  Dale Schuurmans,et al.  Maximum Margin Clustering , 2004, NIPS.

[3]  Raymond J. Mooney,et al.  Integrating constraints and metric learning in semi-supervised clustering , 2004, ICML.

[4]  Mikhail Belkin,et al.  Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples , 2006, J. Mach. Learn. Res..

[5]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[6]  Hsin-Yi Chen,et al.  Semi-supervised clustering with discriminative random fields , 2012, Pattern Recognit..

[7]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[8]  George Karypis,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005, Data Mining and Knowledge Discovery.

[9]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[10]  Hujun Bao,et al.  Laplacian Regularized Gaussian Mixture Model for Data Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[11]  B. Chandra,et al.  A novel approach for distance-based semi-supervised clustering using functional link neural network , 2013, Soft Comput..

[12]  Hao Shao,et al.  Linear semi-supervised projection clustering by transferred centroid regularization , 2012, Journal of Intelligent Information Systems.

[13]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[14]  Daoqiang Zhang,et al.  Semi-supervised clustering with metric learning: An adaptive kernel method , 2010, Pattern Recognit..

[15]  Jiawei Han,et al.  Locally Consistent Concept Factorization for Document Clustering , 2011, IEEE Transactions on Knowledge and Data Engineering.

[16]  Cong Wang,et al.  Data clustering using bacterial foraging optimization , 2011, Journal of Intelligent Information Systems.

[17]  Myra Spiliopoulou,et al.  Density-based semi-supervised clustering , 2010, Data Mining and Knowledge Discovery.

[18]  Qi Huang,et al.  Semi-supervised fuzzy clustering with metric learning and entropy regularization , 2012, Knowl. Based Syst..

[19]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[20]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[21]  Alfredo Cuzzocrea,et al.  A Grid Framework for Approximate Aggregate Query Answering on Summarized Sensor Network Readings , 2004, OTM Workshops.

[22]  Mikhail Belkin,et al.  Manifold Regularization : A Geometric Framework for Learning from Examples , 2004 .

[23]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[24]  Yiu-ming Cheung,et al.  Semi-Supervised Maximum Margin Clustering with Pairwise Constraints , 2012, IEEE Transactions on Knowledge and Data Engineering.

[25]  Inderjit S. Dhillon,et al.  Semi-supervised graph clustering: a kernel approach , 2005, Machine Learning.

[26]  Ian Witten,et al.  Data Mining , 2000 .

[27]  Renata M. C. R. de Souza,et al.  Clustering interval data through kernel-induced feature space , 2012, Journal of Intelligent Information Systems.

[28]  Rui Xu,et al.  Survey of clustering algorithms , 2005, IEEE Transactions on Neural Networks.

[29]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[30]  Nizar Grira,et al.  Unsupervised and Semi-supervised Clustering : a Brief Survey ∗ , 2004 .

[31]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[32]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[33]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[35]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[36]  FayyadUsama,et al.  Hierarchical Clustering Algorithms for Document Datasets , 2005 .

[37]  Alfredo Cuzzocrea,et al.  Storing and retrieving XPath fragments in structured P2P networks , 2006, Data Knowl. Eng..