A customizable hybrid approach to data clustering

Most current data clustering algorithms in data mining are based on a distance calculation in certain metric space. For Spatial Database Systems (SDBS), the Euclidean distance between two data points is often used to represent the relationship between data points. However, in some spatial settings and many other applications, distance alone is not enough to represent all the attributes of the relation between data points. We need a more powerful model to record more relational information between data objects. This paper adopts a graph model by which a database is regarded as a graph: each vertex of the graph represents a data point, and each edge, weighted or unweighted, is used to record the relation between two data points connected by the edge. Based on the graph model, this paper presents a set of cluster analysis criteria to guide data clustering. The criteria can be used to measure clustering results and help improving the quality of clustering. Further, a customizable algorithm using the criteria is proposed and implemented. This algorithm can produce clusters according to users' specifications. Preliminary experiments show encouraging results.

[1]  Ickjai Lee,et al.  AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets , 2000 .

[2]  Susanne E. Hambrusch,et al.  Clustering in Trees: Optimizing Cluster Sizes and Number of Subtrees , 2000, J. Graph Algorithms Appl..

[3]  Peter Eades,et al.  FADE: Graph Drawing, Clustering, and Visual Abstraction , 2000, GD.

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Vladimir Batagelj,et al.  Partitioning Approach to Visualization of Large Graphs , 1999, GD.

[6]  Vipin Kumar,et al.  Chameleon: Hierarchical Clustering Using Dynamic Modeling , 1999, Computer.

[7]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[8]  Kang Zhang,et al.  Locality metrics and program physical structures , 2000, J. Syst. Softw..

[9]  Pankaj K. Agarwal,et al.  Exact and Approximation Algortihms for Clustering , 1997 .

[10]  Paul J. Schweitzer,et al.  Problem Decomposition and Data Reorganization by a Clustering Technique , 1972, Oper. Res..

[11]  David Harel,et al.  A Fast Multi-scale Method for Drawing Large Graphs , 2000, Graph Drawing.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  D. Cook,et al.  Graph-based hierarchical conceptual clustering , 2002 .

[14]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[15]  Ioannis G. Tollis,et al.  Vistool: a tool for visualizing graphs , 2000 .

[16]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[17]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[18]  George Karypis,et al.  C HAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling , 1999 .

[19]  David Harel,et al.  Clustering spatial data using random walks , 2001, KDD '01.

[20]  I. G. Tollis,et al.  Effective graph visualization via node grouping , 2001 .

[21]  Hans-Peter Kriegel,et al.  Clustering for Mining in Large Spatial Databases , 1998, Künstliche Intell..