Clustering large datasets with kernel methods

Real-life datasets are becoming larger and less linear separable. Divisive clustering methods with a computation time linear to the number of samples n can handle large data but mostly assume linear boundaries between the cluster in input space. Kernel based clustering methods are able to detect nonlinear boundaries in feature space but have a quadratic computation time O(n2). In this paper, we propose a meta-algorithm that distributes small-sized subset of the large dataset, parallelized cluster these subset and merges the resulting approximate pseudo-centre repeatedly until the whole dataset has been processed. The meta-algorithm is able to use a wide range of kernel based clustering methods. Here we integrate Kernel Fuzzy C-Means and Relational Neural Gas. We analytically show that the algorithm has a linear computation time O(n). In the experiments we empirically evaluate the performance of the method on two real-life datasets.