Interactive Exploration of Subspace Clusters on Multicore Processors

The PreDeCon clustering algorithm finds arbitrarily shaped clusters in high-dimensional feature spaces, which remains an active research topic with many potential applications. However, it suffers from poor runtime performance, as well as a lack of user interaction. Our new method AnyPDC introduces a novel approach to cope with these problems by casting PreDeCon into an anytime algorithm. In this anytime scheme, it quickly produces an approximate result and iteratively refines it toward the result of PreDeCon at the end. AnyPDC not only significantly speeds up PreDeCon clustering but also allows users to interact with the algorithm during its execution. Moreover, by maintaining an underlying cluster structure consisting of so-called primitive clusters and by block processing of neighborhood queries, AnyPDC can be efficiently executed in parallel on shared memory architectures such as multi-core processors. Experiments on large real world datasets show that AnyPDC achieves high quality approximate results early on, leading to orders of magnitude speedup compared to PreDeCon. Moreover, while anytime techniques are usually slower than batch ones, the algorithmic solution in AnyPDC is actually faster than PreDeCon even if run to the end. AnyPDC also scales well with the number of threads on multi-cores CPUs.

[1]  Shazia Wasim Sadiq,et al.  Discovering interpretable geo-social communities for user behavior prediction , 2016, 2016 IEEE 32nd International Conference on Data Engineering (ICDE).

[2]  Karl Aberer,et al.  An Evaluation of Model-Based Approaches to Sensor Data Compression , 2013, IEEE Transactions on Knowledge and Data Engineering.

[3]  Elke Achtert,et al.  Finding Hierarchies of Subspace Clusters , 2006, PKDD.

[4]  Daisuke Fujiwara,et al.  Scheduling of Image Processing Using Anytime Algorithm for Real-time System , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[5]  Tuyet-Trinh Vu,et al.  An Ensemble System with Random Projection and Dynamic Ensemble Selection , 2018, ACIIDS.

[6]  Karl Aberer,et al.  An Evaluation of Diversification Techniques , 2015, DEXA.

[7]  Robert D. Kleinberg Anytime algorithms for multi-armed bandit problems , 2006, SODA '06.

[8]  Juwhan Song,et al.  An Integrated Simulation Environment Which Automatically Generates and Edits Source Code for Geant4: Geant4Editor , 2007, 2007 International Symposium on Information Technology Convergence (ISITC 2007).

[9]  Karl Aberer,et al.  Minimizing Efforts in Validating Crowd Answers , 2015, SIGMOD Conference.

[10]  Ira Assent,et al.  AnyOut: Anytime Outlier Detection on Streaming Data , 2012, DASFAA.

[11]  Christian Böhm,et al.  Active Density-Based Clustering , 2013, 2013 IEEE 13th International Conference on Data Mining.

[12]  Ira Assent,et al.  Interactive Exploration of Subspace Clusters for High Dimensional Data , 2017, DEXA.

[13]  John Greiner,et al.  A comparison of parallel algorithms for connected components , 1994, SPAA '94.

[14]  Elke Achtert,et al.  Detection and Visualization of Subspace Cluster Hierarchies , 2007, DASFAA.

[15]  Duong Tuan Anh,et al.  An Improvement of PAA for Dimensionality Reduction in Large Time Series Databases , 2008, PRICAI.

[16]  Bela Stantic,et al.  Diversifying Group Recommendation , 2018, IEEE Access.

[17]  Barbara Chapman,et al.  Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation) , 2007 .

[18]  Yang Wang,et al.  SPTF: A Scalable Probabilistic Tensor Factorization Model for Semantic-Aware Behavior Prediction , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[19]  Karl Aberer,et al.  An MAS negotiation support tool for schema matching , 2013, AAMAS.

[20]  Hans-Peter Kriegel,et al.  Density-Connected Subspace Clustering for High-Dimensional Data , 2004, SDM.

[21]  Ira Assent,et al.  AnyDBC: An Efficient Anytime Density-based Clustering Algorithm for Very Large Complex Datasets , 2016, KDD.

[22]  Karl Aberer,et al.  Towards enabling probabilistic databases for participatory sensing , 2014, 10th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing.

[23]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[24]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[25]  Hans-Peter Kriegel,et al.  Density Based Subspace Clustering over Dynamic Data , 2011, SSDBM.

[26]  Padhraic Smyth,et al.  Anytime Exploratory Data Analysis for Massive Data Sets , 1997, KDD.

[27]  Sihem Amer-Yahia,et al.  Scalable Active Temporal Constrained Clustering , 2018, EDBT.

[28]  Xiaofang Zhou,et al.  A System for Spatial-Temporal Trajectory Data Integration and Representation , 2018, DASFAA.

[29]  Karl Aberer,et al.  Tag-Based Paper Retrieval: Minimizing User Effort with Diversity Awareness , 2015, DASFAA.

[30]  Hans-Peter Kriegel,et al.  Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering , 2009, TKDD.

[31]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[32]  Sen Wang,et al.  Provenance-Based Rumor Detection , 2017, ADC.

[33]  Shlomo Zilberstein,et al.  Anytime Sensing Planning and Action: A Practical Model for Robot Control , 1993, IJCAI.

[34]  Ira Assent,et al.  Anytime OPTICS: An Efficient Approach for Hierarchical Density-Based Clustering , 2016, DASFAA.

[35]  Zi Huang,et al.  Restricted Boltzmann Machine Based Active Learning for Sparse Recommendation , 2018, DASFAA.

[36]  Dah-Jye Lee,et al.  Anytime Classification Using the Nearest Neighbor Algorithm with Applications to Stream Mining , 2006, Sixth International Conference on Data Mining (ICDM'06).

[37]  Matthias Weidlich,et al.  Computing Crowd Consensus with Partial Agreement , 2018, IEEE Transactions on Knowledge and Data Engineering.

[38]  Karl Aberer,et al.  Reconciling Schema Matching Networks Through Crowdsourcing , 2014, EAI Endorsed Trans. Collab. Comput..

[39]  Karl Aberer,et al.  Answer validation for generic crowdsourcing tasks with minimal efforts , 2017, The VLDB Journal.

[40]  Sihem Amer-Yahia,et al.  Scalable Interactive Dynamic Graph Clustering on Multicore CPUs , 2019, IEEE Transactions on Knowledge and Data Engineering.

[41]  Christian Böhm,et al.  Density connected clustering with local subspace preferences , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[42]  Liang Chen,et al.  Mobi-SAGE: A Sparse Additive Generative Model for Mobile App Recommendation , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[43]  Mohammed J. Zaki Data Mining and Analysis: Fundamental Concepts and Algorithms , 2014 .

[44]  Hao Wang,et al.  Adapting to User Interest Drift for POI Recommendation , 2016, IEEE Transactions on Knowledge and Data Engineering.

[45]  Tiejun Lv,et al.  A Novel Centrality Cascading Based Edge Parameter Evaluation Method for Robust Influence Maximization , 2017, IEEE Access.

[46]  Arthur Zimek,et al.  A survey on enhanced subspace clustering , 2013, Data Mining and Knowledge Discovery.

[47]  Douglas Alves Peixoto,et al.  Scalable and Fast Top-k Most Similar Trajectories Search Using MapReduce In-Memory , 2016, ADC.

[48]  Duong Tuan Anh,et al.  Using motif information to improve anytime time series classification , 2013, 2013 International Conference on Soft Computing and Pattern Recognition (SoCPaR).

[49]  Sihem Amer-Yahia,et al.  Scalable Active Constrained Clustering for Temporal Data , 2018, DASFAA.

[50]  Karl Aberer,et al.  Result selection and summarization for Web Table search , 2015, 2015 IEEE 31st International Conference on Data Engineering.

[51]  Huan Liu,et al.  Subspace clustering for high dimensional data: a review , 2004, SKDD.

[52]  Ira Assent,et al.  Anytime parallel density-based clustering , 2018, Data Mining and Knowledge Discovery.

[53]  Karl Aberer,et al.  An Evaluation of Aggregation Techniques in Crowdsourcing , 2013, WISE.

[54]  Hans-Peter Kriegel,et al.  Density-based Projected Clustering over High Dimensional Data Streams , 2012, SDM.

[55]  Christian Böhm,et al.  Anytime density-based clustering of complex data , 2014, Knowledge and Information Systems.

[56]  Karl Aberer,et al.  Minimizing Human Effort in Reconciling Match Networks , 2013, ER.

[57]  Nguyen Quoc Viet Hung,et al.  Combining SAX and Piecewise Linear Approximation to Improve Similarity Search on Financial Time Series , 2007, 2007 International Symposium on Information Technology Convergence (ISITC 2007).

[58]  Karl Aberer,et al.  On Leveraging Crowdsourcing Techniques for Schema Matching Networks , 2013, DASFAA.

[59]  Karl Aberer,et al.  Argument discovery via crowdsourcing , 2017, The VLDB Journal.

[60]  Christian Böhm,et al.  Efficient Anytime Density-based Clustering , 2013, SDM.

[61]  Ira Assent,et al.  Scalable and Interactive Graph Clustering Algorithm on Multicore CPUs , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[62]  Shlomo Zilberstein,et al.  Using Anytime Algorithms in Intelligent Systems , 1996, AI Mag..

[63]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[64]  Matthias Weidlich,et al.  Retaining Data from Streams of Social Platforms with Minimal Regret , 2017, IJCAI.

[65]  Eamonn J. Keogh,et al.  Polishing the Right Apple: Anytime Classification Also Benefits Data Streams with Constant Arrival Times , 2010, 2010 IEEE International Conference on Data Mining.