CLUSTERING AND CLUSTER VALIDATION IN DATA MINING

Clustering is one of the fundamental operations in data mining. Clustering is widely used in solving business problems such as customer segmentation and fraud detection. In real applications of clustering, we are required to perform three tasks: partitioning data sets into clusters, validating the clustering results and interpreting the clusters. Various clustering algorithms have been designed for the first task. Few techniques are available for cluster validation in data mining. The third task is application dependent and needs domain knowledge to understand the clusters. In this paper, we present a few techniques for the first two tasks. We first discuss the family of the k-means type algorithms, which are mostly used in data mining. Then we present a visual method for cluster validation. This method is based on the Fastmap data projection algorithm and its enhancement. Finally, we present a method to combine a clustering algorithm and the visual cluster validation method to interactively build classification models.

[1]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[3]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[4]  G. W. Milligan,et al.  CLUSTERING VALIDATION: RESULTS AND IMPLICATIONS FOR APPLIED ANALYSES , 1996 .

[5]  Tao Lin,et al.  An Interactive Approach to Building Classification Models by Clustering and Cluster Validation , 2000, IDEAL.

[6]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[7]  Enrique H. Ruspini,et al.  A New Approach to Clustering , 1969, Inf. Control..

[8]  Anil K. Jain,et al.  Validity studies in clustering methodologies , 1979, Pattern Recognit..

[9]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[10]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[11]  Shokri Z. Selim,et al.  K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Michael R. Anderberg,et al.  Cluster Analysis for Applications , 1973 .

[13]  Tao Lin,et al.  A Visual Method of Cluster Validation with Fastmap , 2000, PAKDD.

[14]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[15]  Chen Ning Fuzzy K-Prototypes Algorithm for Clustering Mixed Numeric and Categorical Valued Data , 2001 .

[16]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[17]  G. W. Milligan,et al.  A monte carlo study of thirty internal criterion measures for cluster analysis , 1981 .

[18]  G. W. Milligan,et al.  The validation of four ultrametric clustering algorithms , 1980, Pattern Recognit..

[19]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[20]  Michael K. Ng,et al.  A fuzzy k-modes algorithm for clustering categorical data , 1999, IEEE Trans. Fuzzy Syst..