Self-taught clustering

This paper focuses on a new clustering task, called self-taught clustering. Self-taught clustering is an instance of unsupervised transfer learning, which aims at clustering a small collection of target unlabeled data with the help of a large amount of auxiliary unlabeled data. The target and auxiliary data can be different in topic distribution. We show that even when the target data are not sufficient to allow effective learning of a high quality feature representation, it is possible to learn the useful features with the help of the auxiliary data on which the target data can be clustered effectively. We propose a co-clustering based self-taught clustering algorithm to tackle this problem, by clustering the target and auxiliary data simultaneously to allow the feature representation from the auxiliary data to influence the target data through a common set of features. Under the new data representation, clustering on the target data can be improved. Our experiments on image clustering show that our algorithm can greatly outperform several state-of-the-art clustering methods when utilizing irrelevant unlabeled auxiliary data.

[1]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[2]  Carla E. Brodley,et al.  Proceedings of the twenty-first international conference on Machine learning , 2004, International Conference on Machine Learning.

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[5]  G. Griffin,et al.  Caltech-256 Object Category Dataset , 2007 .

[6]  Michael I. Jordan,et al.  Multiple kernel learning, conic duality, and the SMO algorithm , 2004, ICML.

[7]  Daniel Marcu,et al.  A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior , 2005, J. Mach. Learn. Res..

[8]  Blaine Nelson,et al.  Revisiting probabilistic models for clustering with pair-wise constraints , 2007, ICML '07.

[9]  Raymond J. Mooney,et al.  A probabilistic framework for semi-supervised clustering , 2004, KDD.

[10]  Thomas G. Dietterich,et al.  Improving SVM accuracy by training on auxiliary data sources , 2004, ICML.

[11]  Claire Cardie,et al.  Constrained K-means Clustering with Background Knowledge , 2001, ICML.

[12]  George Karypis,et al.  Evaluation of hierarchical clustering algorithms for document datasets , 2002, CIKM '02.

[13]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[14]  Rajat Raina,et al.  Constructing informative priors using transfer learning , 2006, ICML.

[15]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[16]  Inderjit S. Dhillon,et al.  Information-theoretic co-clustering , 2003, KDD '03.

[17]  Pietro Perona,et al.  A Bayesian hierarchical model for learning natural scene categories , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[18]  Rich Caruana,et al.  Multitask Learning , 1997, Machine-mediated learning.

[19]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[20]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[21]  S. S. Ravi,et al.  Intractability and clustering with constraints , 2007, ICML '07.

[22]  Thorsten Joachims,et al.  Supervised clustering with support vector machines , 2005, ICML.