Sharing-Aware Outlier Analytics over High-Volume Data Streams

Real-time analytics of anomalous phenomena on streaming data typically relies on processing a large variety of continuous outlier detection requests, each configured with different parameter settings. The processing of such complex outlier analytics workloads is resource consuming due to the algorithmic complexity of the outlier mining process. In this work we propose a sharing-aware multi-query execution strategy for outlier detection on data streams called SOP. A key insight of SOP is to transform the problem of handling a multi-query outlier analytics workload into a single-query skyline computation problem. We prove that the output of the skyline computation process corresponds to the minimal information needed for determining the outlier status of any point in the stream. Based on this new formulation, we design a customized skyline algorithm called K-SKY that leverages the domination relationships among the streaming data points to minimize the number of data points that must be evaluated for supporting multi-query outlier detection. Based on this K-SKY algorithm, our SOP solution achieves minimal utilization of both computational and memory resources for the processing of these complex outlier analytics workload. Our experimental study demonstrates that SOP consistently outperforms the state-of-art solutions by three orders of magnitude in CPU time, while only consuming 5% of their memory footprint - a clear win-win. Furthermore, SOP is shown to scale to large workloads composed of thousands of parameterized queries.

[1]  B. E. Eckbo,et al.  Appendix , 1826, Epilepsy Research.

[2]  Fabrizio Angiulli,et al.  Distance-based outlier queries in data streams: the novel task and algorithms , 2010, Data Mining and Knowledge Discovery.

[3]  Alex Nazaruk,et al.  Big data in capital markets , 2013, SIGMOD '13.

[4]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[5]  A. Madansky Identification of Outliers , 1988 .

[6]  Kyriakos Mouratidis,et al.  Continuous Nearest Neighbor Queries over Sliding Windows , 2007, IEEE Transactions on Knowledge and Data Engineering.

[7]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[8]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[9]  Matthew O. Ward,et al.  Mapping Nominal Values to Numbers for Effective Visualization , 2003, IEEE Symposium on Information Visualization 2003 (IEEE Cat. No.03TH8714).

[10]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[11]  Chetan Gupta,et al.  CHAOS: A Data Stream Analysis Architecture for Enterprise Applications , 2009, 2009 IEEE Conference on Commerce and Enterprise Computing.

[12]  Yufei Tao,et al.  Maintaining sliding window skylines on data streams , 2006, IEEE Transactions on Knowledge and Data Engineering.

[13]  Yannis Manolopoulos,et al.  Continuous monitoring of distance-based outliers over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[14]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[15]  Elke A. Rundensteiner,et al.  State-slice: new paradigm of multi-query optimization of window-based stream queries , 2006, VLDB.

[16]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[17]  Lei Cao,et al.  Scalable distance-based outlier detection over high-volume data streams , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[18]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[19]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[20]  Beng Chin Ooi,et al.  Efficiently Processing Continuous k-NN Queries on Data Streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[21]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[22]  Bernhard Seeger,et al.  Progressive skyline computation in database systems , 2005, TODS.