A Shared Execution Strategy for Multiple Pattern Mining Requests over Streaming Data

In diverse applications ranging from stock trading to traffic monitoring, popular data streams are typically monitored by multiple analysts for patterns of interest. These analysts may submit similar pattern mining requests, such as cluster detection queries, yet customized with different parameter settings. In this work, we present an efficient shared execution strategy for processing a large number of density-based cluster detection queries with arbitrary parameter settings. Given the high algorithmic complexity of the clustering process and the real-time responsiveness required by streaming applications, serving multiple such queries in a single system is extremely resource intensive. The naive method of detecting and maintaining clusters for different queries independently is often in-feasible in practice, as its demands on system resources increase dramatically with the cardinality of the query workload. To overcome this, we analyze the interrelations between the cluster sets identified by queries with different parameters settings, including both pattern-specific and window-specific parameters. We introduce the notion of the growth property among the cluster sets identified by different queries, and characterize the conditions under which it holds. By exploiting this growth property we propose a uniform solution, called Chandi, which represents identified cluster sets as one single compact structure and performs integrated maintenance on them -- resulting in significant sharing of computational and memory resources. Our comprehensive experimental study, using real data streams from domains of stock trades and moving object monitoring, demonstrates that Chandi is on average four times faster than the best alternative methods, while using 85% less memory space in our test cases. It also shows that Chandi scales in handling large numbers of queries on the order of hundreds or even thousands under high input data rates.

[1]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[2]  Hans-Peter Kriegel,et al.  Incremental Clustering for Mining in a Data Warehousing Environment , 1998, VLDB.

[3]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[4]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[5]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[6]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[7]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[8]  Qing Liu,et al.  Efficient Computation of the Skyline Cube , 2005, VLDB.

[9]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[10]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[11]  Joseph M. Hellerstein,et al.  The Case for Precision Sharing , 2004, VLDB.

[12]  Matthew O. Ward,et al.  Neighbor-based pattern detection for windows over streaming data , 2009, EDBT '09.

[13]  Matthew O. Ward,et al.  Multiple Query Optimization for Density-Based Clustering Queries over Streaming Windows , 2009 .

[14]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[15]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[17]  W. J. Kenneally,et al.  JointSTARS and GMTI: past, present and future , 1999 .

[18]  Elke A. Rundensteiner,et al.  State-slice: new paradigm of multi-query optimization of window-based stream queries , 2006, VLDB.