Adaptive non-linear clustering in data streams

Data stream clustering has emerged as a challenging and interesting problem over the past few years. Due to the evolving nature, and one-pass restriction imposed by the data stream model, traditional clustering algorithms are inapplicable for stream clustering. This problem becomes even more challenging when the data is high-dimensional and the clusters are not linearly separable in the input space. In this paper, we propose a nonlinear stream clustering algorithm that adapts to the stream's evolutionary changes. Using the kernel methods for dealing with the non-linearity of data separation, we propose a novel 2-tier stream clustering architecture. Tier-1 captures the temporal locality in the stream, by partitioning it into segments, using a kernel-based novelty detection approach. Tier-2 exploits this segment structure to continuously project the streaming data nonlinearly onto a low-dimensional space (LDS), before assigning them to a cluster. We demonstrate the effectiveness of our approach through extensive experimental evaluation on various real-world datasets.

[1]  Junshui Ma,et al.  Online novelty detection on temporal sequences , 2003, KDD '03.

[2]  Haitao Zhao,et al.  Incremental eigen decomposition , 2003 .

[3]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[4]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[7]  Charu C. Aggarwal,et al.  A framework for diagnosing changes in evolving data streams , 2003, SIGMOD '03.

[8]  Christopher J. Merz,et al.  UCI Repository of Machine Learning Databases , 1996 .

[9]  J. Gower Some distance properties of latent root and vector methods used in multivariate analysis , 1966 .

[10]  Alexander J. Smola,et al.  Learning with kernels , 1998 .

[11]  Yoshua Bengio,et al.  Spectral Clustering and Kernel PCA are Learning Eigenfunctions , 2003 .

[12]  Thorsten Joachims,et al.  A statistical learning learning model of text classification for support vector machines , 2001, SIGIR '01.

[13]  J. Gower Adding a point to vector diagrams in multivariate analysis , 1968 .

[14]  Bernhard Schölkopf,et al.  Nonlinear Component Analysis as a Kernel Eigenvalue Problem , 1998, Neural Computation.

[15]  Philip S. Yu,et al.  On demand classification of data streams , 2004, KDD.

[16]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[17]  D. J. Newman,et al.  UCI Repository of Machine Learning Database , 1998 .

[18]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[19]  Charu C. Aggarwal Hierarchical subspace sampling: a unified framework for high dimensional data reduction, selectivity estimation and nearest neighbor search , 2002, SIGMOD '02.

[20]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[21]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[22]  Bernhard Schölkopf,et al.  Support Vector Method for Novelty Detection , 1999, NIPS.

[23]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[24]  John C. Gower,et al.  Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance , 1999 .

[25]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[26]  Bernhard E. Boser,et al.  A training algorithm for optimal margin classifiers , 1992, COLT '92.

[27]  Shaoning Pang,et al.  One-Pass Incremental Membership Authentication by Face Classification , 2004, ICBA.