Effective Evaluation Measures for Subspace Clustering of Data Streams

Nowadays, most streaming data sources are becoming high-dimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, existing subspace clustering evaluation measures are mainly designed for static data, and cannot reflect the quality of the evolving nature of data streams. On the other hand, available stream clustering evaluation measures care only about the errors of the full-space clustering but not the quality of subspace clustering. In this paper we propose, to the first of our knowledge, the first subspace clustering measure that is designed for streaming data, called SubCMM: Subspace Cluster Mapping Measure. SubCMM is an effective evaluation measure for stream subspace clustering that is able to handle errors caused by emerging, moving, or splitting subspace clusters. Additionally, we propose a novel method for using available offline subspace clustering measures for data streams within the Subspace MOA framework.

[1]  Ali S. Hadi,et al.  Finding Groups in Data: An Introduction to Chster Analysis , 1991 .

[2]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[3]  Malcolm P. Atkinson,et al.  Issues Raised by Three Years of Developing PJama: An Orthogonally Persistent Platform for Java , 1999, ICDT.

[4]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[5]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Ira Assent,et al.  OpenSubspace: An Open Source Framework for Evaluation and Exploration of Subspace Clustering Algorithms in WEKA , 2009 .

[7]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[8]  G. Karypis,et al.  Criterion Functions for Document Clustering ∗ Experiments and Analysis , 2001 .

[9]  Jeff Z. Pan,et al.  An Argument-Based Approach to Using Multiple Ontologies , 2009, SUM.

[10]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[11]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[12]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[13]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[14]  Thomas Seidl,et al.  Stream Data Mining Using the MOA Framework , 2012, DASFAA.

[15]  Thomas Seidl,et al.  An effective evaluation measure for clustering on evolving data streams , 2011, KDD.

[16]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[17]  Ira Assent,et al.  INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Emmanuel Müller,et al.  EDISKCO: energy efficient distributed in-sensor-network k-center clustering with outliers , 2009, SensorKDD '09.

[19]  Guopin Lin,et al.  A Grid and Fractal Dimension-Based Data Stream Clustering Algorithm , 2008, 2008 International Symposium on Information Science and Engineering.

[20]  Edward Y. Chang,et al.  Adaptive non-linear clustering in data streams , 2006, CIKM '06.

[21]  Thomas Seidl,et al.  Precise anytime clustering of noisy sensor data with logarithmic complexity , 2011, SensorKDD '11.

[22]  A. Zimek,et al.  Towards subspace clustering on dynamic data: an incremental version of PreDeCon , 2010, StreamKDD '10.

[23]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[24]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[25]  Thomas Seidl,et al.  Subspace MOA: Subspace Stream Clustering Evaluation Using the MOA Framework , 2013, DASFAA.

[26]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[27]  Won Suk Lee,et al.  Grid-based subspace clustering over data streams , 2007, CIKM '07.

[28]  Mohamed Medhat Gaber,et al.  Density-Based Projected Clustering of Data Streams , 2012, SUM.

[29]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[30]  Mohammed J. Zaki,et al.  SCHISM: a new approach to interesting subspace mining , 2005, Int. J. Bus. Intell. Data Min..

[31]  Wendy R. Fox,et al.  Finding Groups in Data: An Introduction to Cluster Analysis , 1991 .

[32]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[33]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.