Interactive Exploration of Subspace Clusters for High Dimensional Data

PreDeCon is a fundamental clustering algorithm for finding arbitrarily shaped clusters hidden in high-dimensional feature spaces of data, which is an important research topic and has many potential applications. However, it suffers from very high runtime as well as lack of interactions with users. Our algorithm, called AnyPDC, introduces a novel approach to cope with these problems by casting PreDeCon into an anytime algorithm. It quickly produces an approximate result and iteratively refines it toward the result of PreDeCon at the end. This scheme not only significantly speeds up the algorithm but also provides interactions with users during its execution. Experiments conducted on real large datasets show that AnyPDC acquires good approximate results very early, leading to an order of magnitude speedup factor compared to PreDeCon. More interestingly, while anytime techniques usually end up slower than batch ones, AnyPDC is faster than PreDeCon even if it run to the end.

[1]  Elke Achtert,et al.  Finding Hierarchies of Subspace Clusters , 2006, PKDD.

[2]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[3]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[4]  Christian Böhm,et al.  Anytime density-based clustering of complex data , 2014, Knowledge and Information Systems.

[5]  Takahiro Matsuda,et al.  An Anytime Algorithm for Camera-Based Character Recognition , 2013, 2013 12th International Conference on Document Analysis and Recognition.

[6]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[7]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[8]  Dah-Jye Lee,et al.  Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[9]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[10]  Christian Böhm,et al.  Efficient Anytime Density-based Clustering , 2013, SDM.

[11]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[12]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[13]  Elke Achtert,et al.  Detection and Visualization of Subspace Cluster Hierarchies , 2007, DASFAA.

[14]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[15]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[18]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[19]  Christian Böhm,et al.  Active Density-Based Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.