Real-Time Clustering for Large Sparse Online Visitor Data

Online visitor behaviors are often modeled as a large sparse matrix, where rows represent visitors and columns represent behavior. To discover customer segments with different hierarchies, marketers often need to cluster the data in different splits. Such analyses require the clustering algorithm to provide real-time responses on user parameter changes, which the current techniques cannot support. In this paper, we propose a real-time clustering algorithm, sparse density peaks, for large-scale sparse data. It pre-processes the input points to compute annotations and a hierarchy for cluster assignment. While the assignment is only a single scan of the points, a naive pre-processing requires measuring all pairwise distances, which incur a quadratic computation overhead and is infeasible for any moderately sized data. Thus, we propose a new approach based on MinHash and LSH that provides fast and accurate estimations. We also describe an efficient implementation on Spark that addresses data skew and memory usage. Our experiments show that our approach (1) provides a better approximation compared to a straightforward MinHash and LSH implementation in terms of accuracy on real datasets, (2) achieves a 20 × speedup in the end-to-end clustering pipeline, and (3) can maintain computations with a small memory. Finally, we present an interface to explore customer segments from millions of online visitor records in real-time.

[1]  Jeffrey Heer,et al.  Design Considerations for Collaborative Visual Analytics , 2008, Inf. Vis..

[2]  James A. Hendler,et al.  Artificial intelligence planning systems : proceedings of the first international conference, June 15-17, 1992, College Park, Maryland , 1992 .

[3]  Massimo Coppola,et al.  Experiments in Parallel Clustering with DBSCAN , 2001, Euro-Par.

[4]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[5]  Peter Deuflhard,et al.  Numerical Analysis in Modern Scientific Computing , 2003 .

[6]  Shuai Li,et al.  Distributed Clustering of Linear Bandits in Peer to Peer Networks , 2016, ICML.

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  Ping Li,et al.  In Defense of Minhash over Simhash , 2014, AISTATS.

[9]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[10]  A. Hohmann,et al.  Numerical Analysis in Modern Scientific Computing: An Introduction , 2003 .

[11]  Camille Roth,et al.  Natural Scales in Geographical Patterns , 2017, Scientific Reports.

[12]  Alessandro Laio,et al.  Clustering by fast search and find of density peaks , 2014, Science.

[13]  Tobias Schreck,et al.  Visual Cluster Analysis of Trajectory Data with Interactive Kohonen Maps , 2008, 2008 IEEE Symposium on Visual Analytics Science and Technology.

[14]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[15]  Lothar Richter Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets. Cambridge, Cambridge University Press. , 2018 .

[16]  Shuai Li,et al.  On Context-Dependent Clustering of Bandits , 2016, ICML.

[17]  Shuai Li,et al.  Collaborative Filtering Bandits , 2015, SIGIR.

[18]  Ge Yu,et al.  Efficient Distributed Density Peaks for Clustering Large Data Sets in MapReduce , 2016, IEEE Transactions on Knowledge and Data Engineering.

[19]  Rongfang Bie,et al.  Clustering by fast search and find of density peaks via heat diffusion , 2016, Neurocomputing.

[20]  Marco Cavallo,et al.  Clustrophile 2: Guided Visual Clustering Analysis , 2018, IEEE Transactions on Visualization and Computer Graphics.

[21]  Tao Wang,et al.  A Clustering Algorithm for Key Frame Extraction Based on Density Peak , 2018 .

[22]  BieRongfang,et al.  Clustering by fast search and find of density peaks via heat diffusion , 2016 .

[23]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[24]  Andrew Zisserman,et al.  Near Duplicate Image Detection: min-Hash and tf-idf Weighting , 2008, BMVC.

[25]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[26]  Saeed El-Ashram,et al.  Clustering by fast search and merge of local density peaks for gene expression microarray data , 2017, Scientific Reports.

[27]  Christian Böhm,et al.  Density-based clustering using graphics processors , 2009, CIKM.

[28]  Tamara Munzner,et al.  A Nested Model for Visualization Design and Validation , 2009, IEEE Transactions on Visualization and Computer Graphics.

[29]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[30]  MunznerTamara A Nested Model for Visualization Design and Validation , 2009 .

[31]  KhannaSanjeev,et al.  Space-efficient online computation of quantile summaries , 2001 .

[32]  William Ribarsky,et al.  iPCA: An Interactive System for PCA‐based Visual Analytics , 2009, Comput. Graph. Forum.

[33]  Domenico Talia,et al.  Scalable Parallel Clustering for Data Mining on Multicomputers , 2000, IPDPS Workshops.

[34]  Bhanukiran Vinzamuri,et al.  A Survey of Partitional and Hierarchical Clustering Algorithms , 2018, Data Clustering: Algorithms and Applications.

[35]  SatyaJaswanth Badri A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets , 2019 .

[36]  John T. Stasko,et al.  iVisClustering: An Interactive Visual Document Clustering via Topic Modeling , 2012, Comput. Graph. Forum.

[37]  Yunchuan Sun,et al.  Adaptive fuzzy clustering by fast search and find of density peaks , 2015, 2015 International Conference on Identification, Information, and Knowledge in the Internet of Things (IIKI).

[38]  Kenney Ng,et al.  Clustervision: Visual Supervision of Unsupervised Clustering , 2018, IEEE Transactions on Visualization and Computer Graphics.

[39]  Daniel A. Keim,et al.  Visual Analytics: Definition, Process, and Challenges , 2008, Information Visualization.

[40]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[41]  Ben Shneiderman,et al.  Visual Interfaces for Recommendation Systems , 2018, ACM Trans. Intell. Syst. Technol..

[42]  William M. Rand,et al.  Objective Criteria for the Evaluation of Clustering Methods , 1971 .

[43]  Daniel A. Keim,et al.  Visual Analytics: Scope and Challenges , 2008, Visual Data Mining.

[44]  Chris North,et al.  Towards a Systematic Combination of Dimension Reduction and Clustering in Visual Analytics , 2018, IEEE Transactions on Visualization and Computer Graphics.

[45]  Liu Ren,et al.  ViBr: Visualizing Bipartite Relations at Scale with the Minimum Description Length Principle , 2019, IEEE Transactions on Visualization and Computer Graphics.

[46]  Jeffrey Heer,et al.  The Effects of Interactive Latency on Exploratory Visual Analysis , 2014, IEEE Transactions on Visualization and Computer Graphics.

[47]  Feiping Nie,et al.  PurTreeClust: A Clustering Algorithm for Customer Segmentation from Massive Customer Transaction Data , 2018, IEEE Transactions on Knowledge and Data Engineering.

[48]  Qing He,et al.  Parallel K-Means Clustering Based on MapReduce , 2009, CloudCom.

[49]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[50]  Pavel Berkhin,et al.  A Survey of Clustering Data Mining Techniques , 2006, Grouping Multidimensional Data.

[51]  Inderjit S. Dhillon,et al.  A Data-Clustering Algorithm on Distributed Memory Multiprocessors , 1999, Large-Scale Parallel Data Mining.

[52]  Fan Meng,et al.  A novel clustering-based image segmentation via density peaks algorithm with mid-level feature , 2017, Neural Computing and Applications.