Hierarchical Clustering Via Localized Diffusion Folders

Data clustering is a common technique for statistical data analysis. It is used in many fields including machine learning, data mining, customer segmentation, trend analysis, pattern recognition and image analysis. The proposed Localized Diffusion Folders methodology performs hierarchical clustering of high-dimensional datasets. The diffusion folders are multi-level data partitioning into local neighborhoods that are generated by several random selections of data points and folders in a diffusion graph and by defining local diffusion distances between them. This multi-level partitioning defines an improved localized geometry of the data and a localized Markov transition matrix that is used for the next time step in the diffusion process. The result of this clustering method is a bottom-up hierarchical clustering of the data while each level in the hierarchy contains localized diffusion folders of folders from the lower levels. This methodology preserves the local neighborhood of each point while eliminating noisy connections between distinct points and areas in the graph. The performance of the algorithms is demonstrated on real data and it is compared to existing methods.

[1]  R. Fisher THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS , 1936 .

[2]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[3]  H. Hahn Sur quelques points du calcul fonctionnel , 1908 .

[4]  A. Averbuch,et al.  Smart-Sample : An Efficient Algorithm for Clustering Large High-Dimensional Datasets , 2009 .

[5]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[6]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[9]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[10]  H. Hahn Bemerkungen zu den Untersuchungen des Herrn M. Fréchet: Sur quelques points du calcul fonctionnel , 1908 .

[11]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[12]  Julia Couto,et al.  Kernel K-Means for Categorical Data , 2005, IDA.

[13]  Johannes Gehrke,et al.  CACTUS—clustering categorical data using summaries , 1999, KDD '99.

[14]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[15]  Jianhong Wu,et al.  Data clustering - theory, algorithms, and applications , 2007 .

[16]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[17]  Hava T. Siegelmann,et al.  Support Vector Clustering , 2002, J. Mach. Learn. Res..

[18]  B. Nadler,et al.  Diffusion maps, spectral clustering and reaction coordinates of dynamical systems , 2005, math/0503445.

[19]  Mark A. Girolami,et al.  Mercer kernel-based clustering in feature space , 2002, IEEE Trans. Neural Networks.

[20]  Rong Zhang,et al.  A large scale clustering scheme for kernel K-Means , 2002, Object recognition supported by user interaction for service robots.

[21]  Stéphane Lafon,et al.  Diffusion maps , 2006 .

[22]  Fan Chung,et al.  Spectral Graph Theory , 1996 .

[23]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[24]  E. Spanier,et al.  Set theory and metric spaces , 1955 .

[25]  R. Coifman,et al.  Geometric harmonics: A novel tool for multiscale out-of-sample extension of empirical functions , 2006 .

[26]  Franz Aurenhammer,et al.  Voronoi diagrams—a survey of a fundamental geometric data structure , 1991, CSUR.

[27]  R. Coifman,et al.  A general framework for adaptive regularization based on diffusion processes on graphs , 2006 .

[28]  Hisashi Koga,et al.  Fast Hierarchical Clustering Algorithm Using Locality-Sensitive Hashing , 2004, Discovery Science.