Real-life datasets are becoming larger and less linear separable. Divisive clustering methods with a computation time linear to the number of samples n can handle large data but mostly assume linear boundaries between the cluster in input space. Kernel based clustering methods are able to detect nonlinear boundaries in feature space but have a quadratic computation time O(n2). In this paper, we propose a meta-algorithm that distributes small-sized subset of the large dataset, parallelized cluster these subset and merges the resulting approximate pseudo-centre repeatedly until the whole dataset has been processed. The meta-algorithm is able to use a wide range of kernel based clustering methods. Here we integrate Kernel Fuzzy C-Means and Relational Neural Gas. We analytically show that the algorithm has a linear computation time O(n). In the experiments we empirically evaluate the performance of the method on two real-life datasets.
[1]
Rong Jin,et al.
Approximate kernel k-means: solution to large scale kernel clustering
,
2011,
KDD.
[2]
Jiawei Han,et al.
Efficient and Effective Clustering Methods for Spatial Data Mining
,
1994,
VLDB.
[3]
Barbara Hammer,et al.
Relational Neural Gas
,
2007,
KI.
[4]
Sudipto Guha,et al.
Clustering Data Streams: Theory and Practice
,
2003,
IEEE Trans. Knowl. Data Eng..
[5]
Friedhelm Schwenker,et al.
Parallelized Kernel Patch Clustering
,
2010,
ANNPR.
[6]
Hans-Peter Kriegel,et al.
A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
,
1996,
KDD.
[7]
Daoqiang Zhang,et al.
Fuzzy clustering using kernel method
,
2002
.
[8]
Tony Jebara,et al.
Probability Product Kernels
,
2004,
J. Mach. Learn. Res..