Halite: Fast and Scalable Multiresolution Local-Correlation Clustering

This paper proposes Halite, a novel, fast, and scalable clustering method that looks for clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in time and space regarding the data size and dimensionality, and the dimensionality of the clusters' subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or better quality compared to top related works; and 4) Generality: it includes a soft clustering approach. Experiments on synthetic data ranging from five to 30 axes and up to 1 \rm million points were performed. Halite was in average at least 12 times faster than seven representative works, and always presented highly accurate results. On real data, Halite was at least 11 times faster than others, increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where soft clustering is desirable.

[1]  Mark A. Pitt,et al.  Advances in Minimum Description Length: Theory and Applications , 2005 .

[2]  Philip S. Yu,et al.  Redefining Clustering for High-Dimensional Applications , 2002, IEEE Trans. Knowl. Data Eng..

[3]  Elke Achtert,et al.  Robust, Complete, and Efficient Correlation Clustering , 2007, SDM.

[4]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[5]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[6]  Christos Faloutsos,et al.  Fast Indexing and Visualization of Metric Data Sets using Slim-Trees , 2002, IEEE Trans. Knowl. Data Eng..

[7]  Christian Böhm,et al.  Robust information-theoretic clustering , 2006, KDD '06.

[8]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[9]  Christian Böhm,et al.  Computing Clusters of Correlation Connected objects , 2004, SIGMOD '04.

[10]  Jörg Sander,et al.  Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering , 2008, KDD.

[11]  Christos Faloutsos,et al.  Fast feature selection using fractal dimension , 2010, J. Inf. Data Manag..

[12]  Man Lung Yiu,et al.  Iterative projected clustering by subspace mining , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[14]  Raymond Chi-Wing Wong,et al.  Projective clustering by histograms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[15]  Michael K. Ng,et al.  HARP: a practical projected clustering algorithm , 2004, IEEE Transactions on Knowledge and Data Engineering.

[16]  Christos Faloutsos,et al.  Finding Clusters in subspaces of very large, multi-dimensional datasets , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[17]  Anthony K. H. Tung,et al.  CURLER: finding and visualizing nonlinear correlation clusters , 2005, SIGMOD '05.

[18]  Jiong Yang,et al.  An Approach to Active Spatial Data Mining Based on Statistical Information , 2000, IEEE Trans. Knowl. Data Eng..

[19]  Martin Ester,et al.  Robust projected clustering , 2007, Knowledge and Information Systems.

[20]  Elke Achtert,et al.  Global Correlation Clustering Based on the Hough Transform , 2008, Stat. Anal. Data Min..

[21]  Jiong Yang,et al.  STING: A Statistical Information Grid Approach to Spatial Data Mining , 1997, VLDB.

[22]  Christian Böhm,et al.  Outlier-robust clustering using independent components , 2008, SIGMOD Conference.