Transfer clustering via constraints generated from topics

Clustering technique is widely used in data mining like gene-microarray analysis and natural language processing. When there are sufficient data samples and good representations, traditional clustering algorithms such as K-means can work well. But when the number of samples is small and the data representation is bad, direct use of clustering may yield bad results. In this paper we propose a new algorithm TCTC(Topic-Constraint Transfer Clustering), which is an instance of unsupervised transfer learning, to cluster a small number of unlabeled data with the help of sufficient and better represented auxiliary data. First several latent topics are extracted from the clusters of the auxiliary data. Then the affinities between target data samples and topics are discovered to “guide” the disseminated data clustering. Finally semi-supervised clustering algorithm is applied on target data. The experiments demonstrate our method is quite effective to solve the problem of disseminated and ill-presented data clustering.

[1]  Claire Cardie,et al.  Proceedings of the Eighteenth International Conference on Machine Learning, 2001, p. 577–584. Constrained K-means Clustering with Background Knowledge , 2022 .

[2]  Shaogang Gong,et al.  Unsupervised Selective Transfer Learning for Object Recognition , 2010, ACCV.

[3]  Argyris Kalogeratos,et al.  Document clustering using synthetic cluster prototypes , 2011, Data Knowl. Eng..

[4]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Changshui Zhang,et al.  Transferred Dimensionality Reduction , 2008, ECML/PKDD.

[6]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[7]  David G. Lowe,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004, International Journal of Computer Vision.

[8]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Qiang Yang,et al.  Self-taught clustering , 2008, ICML '08.

[10]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jing Hua,et al.  Incorporating User Provided Constraints into Document Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[12]  Tat-Seng Chua,et al.  NUS-WIDE: a real-world web image database from National University of Singapore , 2009, CIVR '09.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[15]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[16]  Fei Wang,et al.  Semi-Supervised Clustering via Matrix Factorization , 2008, SDM.

[17]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[18]  Xiang Ji,et al.  Document clustering with prior knowledge , 2006, SIGIR.

[19]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.