Scalable Multi-Parameter Outlier Detection Technology

The real-time detection of anomalous phenomena on streaming data has become increasingly important for applications ranging from fraud detection, financial analysis to traffic management. In these streaming applications, often a large number of similar continuous outlier detection queries are executed concurrently. In the light of the high algorithmic complexity of detecting and maintaining outlier patterns for different parameter settings independently, we propose a shared execution methodology called SOP that handles a large batch of requests with diverse pattern configurations. First, our systematic analysis reveals opportunities for maximum resource sharing by leveraging commonalities among outlier detection queries. For that, we introduce a sharing strategy that integrates all computation results into one compact data structure. It leverages temporal relationships among stream data points to prioritize the probing process. Second, this work is the first to consider predicate constraints in the outlier detection context. By distinguishing between target and scope constraints, customized fragment sharing and block selection strategies can be effectively applied to maximize the efficiency of system resource utilization. Our experimental studies utilizing real stream data demonstrate that our approach performs 3 orders of magnitude faster than the startof-the-art and scales to 1000s of queries.

[1]  Michael J. Franklin,et al.  On-the-fly sharing for streamed aggregation , 2006, SIGMOD Conference.

[2]  Walid G. Aref,et al.  Scheduling for shared window joins over data streams , 2003, VLDB.

[3]  Aoying Zhou,et al.  Density-Based Clustering over an Evolving Data Stream with Noise , 2006, SDM.

[4]  Lei Cao,et al.  Scalable distance-based outlier detection over high-volume data streams , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[5]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[6]  Jennifer Widom,et al.  Resource Sharing in Continuous Sliding-Window Aggregates , 2004, VLDB.

[7]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[8]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[9]  Matthew O. Ward,et al.  Neighbor-based pattern detection for windows over streaming data , 2009, EDBT '09.

[10]  Jennifer Widom,et al.  The CQL continuous query language: semantic foundations and query execution , 2006, The VLDB Journal.

[11]  Li Tu,et al.  Density-based clustering for real-time stream data , 2007, KDD '07.

[12]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[13]  Elke A. Rundensteiner,et al.  State-slice: new paradigm of multi-query optimization of window-based stream queries , 2006, VLDB.

[14]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[15]  Matthew O. Ward,et al.  Shared execution strategy for neighbor-based pattern mining requests over streaming windows , 2012, ACM Trans. Database Syst..

[16]  Beng Chin Ooi,et al.  Multiple aggregations over data streams , 2005, SIGMOD '05.

[17]  Yannis Manolopoulos,et al.  Continuous monitoring of distance-based outliers over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[18]  Philip S. Yu,et al.  A Framework for Clustering Evolving Data Streams , 2003, VLDB.

[19]  Fabrizio Angiulli,et al.  DOLPHIN: An efficient algorithm for mining distance-based outliers in very large datasets , 2009, TKDD.

[20]  Matthew O. Ward,et al.  MTopS: scalable processing of continuous top-k multi-query workloads , 2011, CIKM '11.

[21]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[22]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.