External evaluation measures for subspace clustering

Knowledge discovery in databases requires not only development of novel mining techniques but also fair and comparable quality assessment based on objective evaluation measures. Especially in young research areas where no common measures are available, researchers are unable to provide a fair evaluation. Typically, publications glorify the high quality of one approach only justified by an arbitrary evaluation measure. However, such conclusions can only be drawn if the evaluation measures themselves are fully understood. In this paper, we provide the basis for systematic evaluation in the emerging research area of subspace clustering. We formalize general quality criteria for subspace clustering measures not yet addressed in the literature. We compare the existing external evaluation methods based on these criteria and pinpoint limitations. We propose a novel external evaluation measure which meets the requirements in form of quality properties. In thorough experiments we empirically show characteristic properties of evaluation measures. Overall, we provide a set of evaluation measures that fulfill the general quality criteria as recommendation for future evaluations. All measures and datasets are provided on our website and are integrated in our evaluation framework.

[1]  Alok N. Choudhary,et al.  Adaptive Grids for Clustering Massive Data Sets , 2001, SDM.

[2]  Albrecht Zimmermann,et al.  The Chosen Few: On Identifying Valuable Patterns , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[3]  Hui Xiong,et al.  Understanding of Internal Clustering Validation Measures , 2010, 2010 IEEE International Conference on Data Mining.

[4]  Ira Assent,et al.  DUSC: Dimensionality Unbiased Subspace Clustering , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[5]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[6]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[7]  T. M. Murali,et al.  A Monte Carlo algorithm for fast projective clustering , 2002, SIGMOD '02.

[8]  Julio Gonzalo,et al.  A comparison of extrinsic clustering evaluation metrics based on formal constraints , 2008, Information Retrieval.

[9]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[10]  Man Lung Yiu,et al.  Frequent-pattern based iterative projected clustering , 2003, Third IEEE International Conference on Data Mining.

[11]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[12]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[13]  Hui Xiong,et al.  Adapting the right measures for K-means clustering , 2009, KDD.

[14]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[15]  Marina Meila,et al.  Comparing subspace clusterings , 2006, IEEE Transactions on Knowledge and Data Engineering.

[16]  Ira Assent,et al.  Evaluating Clustering in Subspace Projections of High Dimensional Data , 2009, Proc. VLDB Endow..

[17]  Ira Assent,et al.  INSCY: Indexing Subspace Clusters with In-Process-Removal of Redundancy , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[18]  Hans-Peter Kriegel,et al.  Subspace and projected clustering: experimental evaluation and analysis , 2009, Knowledge and Information Systems.

[19]  Ira Assent,et al.  Relevant Subspace Clustering: Mining the Most Interesting Non-redundant Concepts in High Dimensional Data , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[20]  Martin Ester,et al.  P3C: A Robust Projected Clustering Algorithm , 2006, Sixth International Conference on Data Mining (ICDM'06).

[21]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.