论文信息 - Clustering large datasets with kernel methods

Clustering large datasets with kernel methods

Real-life datasets are becoming larger and less linear separable. Divisive clustering methods with a computation time linear to the number of samples n can handle large data but mostly assume linear boundaries between the cluster in input space. Kernel based clustering methods are able to detect nonlinear boundaries in feature space but have a quadratic computation time O(n2). In this paper, we propose a meta-algorithm that distributes small-sized subset of the large dataset, parallelized cluster these subset and merges the resulting approximate pseudo-centre repeatedly until the whole dataset has been processed. The meta-algorithm is able to use a wide range of kernel based clustering methods. Here we integrate Kernel Fuzzy C-Means and Relational Neural Gas. We analytically show that the algorithm has a linear computation time O(n). In the experiments we empirically evaluate the performance of the method on two real-life datasets.

Friedhelm Schwenker | Stefan Faußer

[1] Rong Jin,et al. Approximate kernel k-means: solution to large scale kernel clustering , 2011, KDD.

[2] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[3] Barbara Hammer,et al. Relational Neural Gas , 2007, KI.

[4] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[5] Friedhelm Schwenker,et al. Parallelized Kernel Patch Clustering , 2010, ANNPR.

[6] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[7] Daoqiang Zhang,et al. Fuzzy clustering using kernel method , 2002 .

[8] Tony Jebara,et al. Probability Product Kernels , 2004, J. Mach. Learn. Res..