A fuzzy c means variant for clustering evolving data streams

Clustering algorithms for streaming data sets are gaining importance due to the availability of large data streams from different sources. Recently a number of streaming algorithms have been proposed using crisp algorithms such as hard c means or its variants. The crisp cases may not be easily generalized to fuzzy cases as these two groups of algorithms try to optimize different objective functions. In this paper we propose a streaming variant of the fuzzy c means algorithm. At any stage during processing, a good streaming algorithm should be able to summarize data seen so far and also respond to evolving distributions. We study the tradeoff involved between summarization of data seen and response to an evolving distribution by varying the amount of history used by a streaming algorithm. Empirical evaluation of the performance of our algorithm using both artificial and real data sets under a noisy setting shows its effectiveness.

[1]  Lawrence O. Hall,et al.  Fast clustering with application to fuzzy rule generation , 1995, Proceedings of 1995 IEEE International Conference on Fuzzy Systems..

[2]  James C. Bezdek,et al.  Extending fuzzy and probabilistic clustering to very large data sets , 2006, Comput. Stat. Data Anal..

[3]  Lawrence O. Hall,et al.  Fast fuzzy clustering , 1998, Fuzzy Sets Syst..

[4]  Eyke Hüllermeier,et al.  Online clustering of parallel data streams , 2006, Data Knowl. Eng..

[5]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[6]  Charles Elkan,et al.  Scalability for clustering algorithms revisited , 2000, SKDD.

[7]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[9]  Paul S. Bradley,et al.  Scaling Clustering Algorithms to Large Databases , 1998, KDD.

[10]  Robert L. Grossman,et al.  GenIc: A Single-Pass Generalized Incremental Algorithm for Clustering , 2004, SDM.

[11]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[12]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[13]  James C. Bezdek,et al.  Optimization of clustering criteria by reformulation , 1995, IEEE Trans. Fuzzy Syst..

[14]  Peter Xiaoping Liu,et al.  Online data-driven fuzzy clustering with applications to real-time robotic tracking , 2004, IEEE Transactions on Fuzzy Systems.

[15]  Ming-Syan Chen,et al.  Clustering on demand for multiple data streams , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[16]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[17]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[18]  Fabio A. González,et al.  TECNO-STREAMS: tracking evolving clusters in noisy data streams with a scalable immune system learning model , 2003, Third IEEE International Conference on Data Mining.

[19]  Lawrence O. Hall,et al.  Fast Accurate Fuzzy Clustering through Data Reduction , 2003 .

[20]  James C. Bezdek,et al.  Complexity reduction for "large image" processing , 2002, IEEE Trans. Syst. Man Cybern. Part B.

[21]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[22]  Su Myeon Kim,et al.  DCF: An Efficient Data Stream Clustering Framework for Streaming Applications , 2006, DEXA.

[23]  Jie Zhou,et al.  HClustream: A Novel Approach for Clustering Evolving Heterogeneous Data Stream , 2006, Sixth IEEE International Conference on Data Mining - Workshops (ICDMW'06).

[24]  H. Kuhn The Hungarian method for the assignment problem , 1955 .

[25]  Pasi Fränti,et al.  Gradual model generator for single-pass clustering , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[26]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[27]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[28]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[29]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[30]  Jiong Yang Dynamic clustering of evolving streams with a single pass , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).