Unsupervised and Semi-supervised Clustering : a Brief Survey ∗

Clustering (or cluster analysis) aims to organize a collection of data items into clusters, such that items within a cluster are more “similar” to each other than they are to items in the other clusters. This notion of similarity can be expressed in very different ways, according to the purpose of the study, to domain-specific assumptions and to prior knowledge of the problem. Clustering is usually performed when no information is available concerning the membership of data items to predefined classes. For this reason, clustering is traditionally seen as part of unsupervised learning. We nevertheless speak here of unsupervised clustering to distinguish it from a more recent and less common approach that makes use of a small amount of supervision to “guide” or “adjust” clustering (see section 2). To support the extensive use of clustering in computer vision, pattern recognition, information retrieval, data mining, etc., very many different methods were developed in several communities. Detailed surveys of this domain can be found in [25], [27] or [26]. In the following, we attempt to briefly review a few core concepts of cluster analysis and describe categories of clustering methods that are best represented in the literature. We also take this opportunity to provide some pointers to more recent work on clustering.

[1]  P. Sneath,et al.  Numerical Taxonomy , 1962, Nature.

[2]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[3]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[4]  Benjamin King Step-Wise Clustering Procedures , 1967 .

[5]  Charles T. Zahn,et al.  Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters , 1971, IEEE Transactions on Computers.

[6]  Brian Everitt,et al.  Cluster analysis , 1974 .

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  Fionn Murtagh,et al.  Structure of hierarchic clusterings: implications for information retrieval and for multivariate data analysis , 1984, Inf. Process. Manag..

[9]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[10]  Isak Gath,et al.  Unsupervised Optimal Fuzzy Clustering , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  Rajesh N. Dave,et al.  Use Of The Adaptive Fuzzy Clustering Algorithm To Detect Lines In Digital Images , 1990, Other Conferences.

[12]  Gerardo Beni,et al.  A Validity Measure for Fuzzy Clustering , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[13]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[14]  A. Raftery,et al.  Model-based Gaussian and non-Gaussian clustering , 1993 .

[15]  Gérard Govaert,et al.  Gaussian parsimonious clustering models , 1995, Pattern Recognit..

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  Rajesh N. Davé,et al.  Robust clustering methods: a unified view , 1997, IEEE Trans. Fuzzy Syst..

[18]  Hichem Frigui,et al.  Clustering by competitive agglomeration , 1997, Pattern Recognit..

[19]  Joshua Zhexue Huang,et al.  A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining , 1997, DMKD.

[20]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[21]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[22]  James C. Bezdek,et al.  Some new indexes of cluster validity , 1998, IEEE Trans. Syst. Man Cybern. Part B.

[23]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[24]  Ayhan Demiriz,et al.  Semi-Supervised Clustering Using Genetic Algorithms , 1999 .

[25]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[26]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[27]  Hichem Frigui,et al.  A Robust Competitive Clustering Algorithm With Applications in Computer Vision , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[28]  Claire Cardie,et al.  Clustering with Instance-Level Constraints , 2000, AAAI/IAAI.

[29]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[30]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[31]  Nozha Boujemaa,et al.  Unsupervised robust clustering for image database categorization , 2002, Object recognition supported by user interaction for service robots.

[32]  Dan Klein,et al.  From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering , 2002, ICML.

[33]  Michael I. Jordan,et al.  Distance Metric Learning with Application to Clustering with Side-Information , 2002, NIPS.

[34]  Arindam Banerjee,et al.  Semi-supervised Clustering by Seeding , 2002, ICML.

[35]  Raymond J. Mooney,et al.  Adaptive duplicate detection using learnable string similarity measures , 2003, KDD '03.

[36]  Andrew McCallum,et al.  Semi-Supervised Clustering with User Feedback , 2003 .

[37]  R. Mooney,et al.  Comparing and Unifying Search-Based and Similarity-Based Approaches to Semi-Supervised Clustering , 2003 .