Distributed Clustering Using Collective Principal Component Analysis

Abstract. This paper considers distributed clustering of high-dimensional heterogeneous data using a distributed principal component analysis (PCA) technique called the collective PCA. It presents the collective PCA technique, which can be used independent of the clustering application. It shows a way to integrate the Collective PCA with a given off-the-shelf clustering algorithm in order to develop a distributed clustering technique. It also presents experimental results using different test data sets including an application for web mining.

[1]  H. Hotelling Analysis of a complex of statistical variables into principal components. , 1933 .

[2]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[3]  Anil K. Jain,et al.  Clustering Methodologies in Exploratory Data Analysis , 1980, Adv. Comput..

[4]  Tosio Kato Perturbation theory for linear operators , 1966 .

[5]  John H. Mathews,et al.  Numerical Methods For Mathematics, Science, and Engineering , 1987 .

[6]  Ori Sasson,et al.  Non-Expansive Hashing , 1996, STOC '96.

[7]  M. J. Maron,et al.  Numerical Analysis: A Practical Approach , 1982 .

[8]  W. Kahan,et al.  The Rotation of Eigenvectors by a Perturbation. III , 1970 .

[9]  W. Cleveland,et al.  Solme Robust Statistical Procedures and Their Application To Air Pollution Data , 1976 .

[10]  Hillol Kargupta,et al.  Collective, Hierarchical Clustering from Distributed, Heterogeneous Data , 1999, Large-Scale Parallel Data Mining.

[11]  J. Birren,et al.  Analysis of the WAIS subtests in relation to age and education. , 1961, Journal of gerontology.

[12]  Robin Sibson,et al.  SLINK: An Optimally Efficient Algorithm for the Single-Link Cluster Method , 1973, Comput. J..

[13]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[14]  Wai Lam,et al.  Distributed data mining of probabilistic knowledge , 1997, Proceedings of 17th International Conference on Distributed Computing Systems.

[15]  Daryl E. Hershberger,et al.  Collective Data Mining: a New Perspective toward Distributed Data Mining Advances in Distributed Data Mining Book , 1999 .

[16]  Salvatore J. Stolfo,et al.  Sharing Learned Models among Remote Database Partitions by Local Meta-Learning , 1996, KDD.

[17]  J. Macgregor,et al.  Analysis of multiblock and hierarchical PCA and PLS models , 1998 .

[18]  P. Switzer,et al.  A transformation for ordering multispectral data in terms of image quality with implications for noise removal , 1988 .

[19]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[20]  L. Cowen,et al.  Randomized Nonlinear Projections Uncover High-Dimensional Structure , 1997 .

[21]  Kagan Tumer,et al.  Robust Order Statistics Based Ensembles for Distributed Data Mining , 2001 .

[22]  A. K. Onig,et al.  A Survey of Methods for Multivariate Data Projection, Visualisation and Interactive Analysis , .

[23]  Robert L. Grossman,et al.  The Preliminary Design of Papyrus: A System for High Performance Distributed Data Mining over Cluste , 1998, AAAI 1998.

[24]  Philip K. Chan,et al.  Advances in Distributed and Parallel Knowledge Discovery , 2000 .

[25]  J. B. Lee,et al.  Enhancement of high spectral resolution remote-sensing data by a noise-adjusted principal components transform , 1990 .

[26]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[27]  Viviane Crestana Jensen,et al.  Mining decentralized data repositories. , 2001 .

[28]  G. Stewart Error and Perturbation Bounds for Subspaces Associated with Certain Eigenvalue Problems , 1973 .

[29]  Christos Faloutsos,et al.  Quantifiable data mining using principal component analysis , 1997 .

[30]  P. Spreij Probability and Measure , 1996 .

[31]  Jack Dongarra,et al.  LAPACK Users' Guide, 3rd ed. , 1999 .

[32]  Svante Wold,et al.  Pattern recognition by means of disjoint principal components models , 1976, Pattern Recognit..

[33]  Karl Pearson F.R.S. LIII. On lines and planes of closest fit to systems of points in space , 1901 .

[34]  Kenji Yamanishi,et al.  Distributed cooperative Bayesian learning strategies , 1997, COLT '97.

[35]  J. Bourgain On lipschitz embedding of finite metric spaces in Hilbert space , 1985 .

[36]  D. F. Morrison,et al.  Multivariate Statistical Methods , 1968 .

[37]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[38]  I. F. Jones,et al.  SIGNAL‐TO‐NOISE RATIO ENHANCEMENT IN MULTICHANNEL SEISMIC DATA VIA THE KARHUNEN‐LOÉVE TRANSFORM* , 1987 .

[39]  Hillol Kargupta,et al.  Distributed Multivariate Regression Using Wavelet-Based Collective Data Mining , 2001, J. Parallel Distributed Comput..

[40]  Clark F. Olson,et al.  Parallel Algorithms for Hierarchical Clustering , 1995, Parallel Comput..

[41]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[42]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[43]  Gene H. Golub,et al.  Matrix computations , 1983 .

[44]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[45]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[46]  H Morishima,et al.  Estimation of genetic contribution of principal components to individual variates concerned. , 1969, Biometrics.

[47]  Ilker Hamzaoglu,et al.  Scalable, Distributed Data Mining - An Agent Architecture , 1997, KDD.

[48]  Ramesh C. Jain,et al.  Similarity indexing: algorithms and performance , 1996, Electronic Imaging.

[49]  David S. Watkins,et al.  Fundamentals of matrix computations , 1991 .

[50]  J N Lee,et al.  The contrast‐to‐noise in relaxation time, synthetic, and weighted‐sum MR images , 1987, Magnetic resonance in medicine.

[51]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.