A Shrinking-Based Approach for Multi-Dimensional Data Analysis

Existing data analysis techniques have difficulty in handling multi-dimensional data. In this paper, we first present a novel data preprocessing technique called shrinking which optimizes the inner structure of data inspired by the Newton's Universal Law of Gravitation[22] in the real world. This data reorganization concept can be applied in many fields such as pattern recognition, data clustering and signal processing. Then, as an important application of the data shrinking preprocessing, we propose a shrinking-based approach for multi-dimensional data analysis which consists of three steps: data shrinking, cluster detection, and cluster evaluation and selection. The process of data shrinking moves data points along the direction of the density gradient, thus generating condensed, widely-separated clusters. Following data shrinking, clusters are detected by finding the connected components of dense cells. The data-shrinking and cluster-detection steps are conducted on a sequence of grids with different cell sizes. The clusters detected at these scales are compared by a cluster-wise evaluation measurement, and the best clusters are selected as the final result. The experimental results show that this approach can effectively and efficiently detect clusters in both low- and high-dimensional spaces.

[1]  Narendra Ahuja,et al.  Dot Pattern Processing Using Voronoi Neighborhoods , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[4]  L. Cowen,et al.  Randomized Nonlinear Projections Uncover High-Dimensional Structure , 1997 .

[5]  Hans-Peter Kriegel,et al.  Optimal multi-step k-nearest neighbor search , 1998, SIGMOD '98.

[6]  Farn Chen,et al.  THE VALIDITY MEASUREMENT OF FUZZY C- MEANS CLASSIFIER FOR REMOTELY SENSED IMAGES , 2001 .

[7]  Warren. Viessman Introduction to hydrology , 1972 .

[8]  Sudipto Guha,et al.  ROCK: a robust clustering algorithm for categorical attributes , 1999, Proceedings 15th International Conference on Data Engineering (Cat. No.99CB36337).

[9]  Aidong Zhang,et al.  WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[10]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[11]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[12]  H. L. Le Roy,et al.  Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability; Vol. IV , 1969 .

[13]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[14]  Michalis Vazirgiannis,et al.  Clustering algorithms and validity measures , 2001, Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001.

[15]  Michalis Vazirgiannis,et al.  A Data Set Oriented Approach for Clustering Algorithm Selection , 2001, PKDD.

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[18]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[19]  Thomas C. Redman,et al.  Data Quality Management and Technology , 1992 .

[20]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[21]  Michalis Vazirgiannis,et al.  Clustering validity assessment: finding the optimal partitioning of a data set , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[22]  Hans-Peter Kriegel,et al.  OPTICS: ordering points to identify the clustering structure , 1999, SIGMOD '99.

[23]  Peter J. Haas,et al.  The New Jersey Data Reduction Report , 1997 .

[24]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[25]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.