Fast Distance-based Outlier Detection in Data Streams based on Micro-clusters

Continuous outlier detection in data streams is one important topic in data mining. It has many applications in public health, network intrusion detection, and fraud detection. Over the last two decades of research, many studies have been conducted on distance-based outlier detection algorithms which are viable, scalable, and parameter-free approaches. Because streaming data points arrive and expire over time, the challenge is to monitor the outlier status of data points with time and space efficiency. In this study, we propose three algorithms: O-MCOD, U-MCOD, and M-MCOD. These algorithms improve upon the state-of-the-art algorithm in distance-based outlier detection in data streams, i.e., MCOD, by relaxing the constraints of micro-clusters and using the minimal probing principal. With extensive experiments on synthetic and real-world datasets, we show that the proposed algorithms are superior in time and space efficiency. Specially, our proposed algorithms are 1.5 to 95 times faster than MCOD, require as low as 25% peak memory compared to MCOD, and outperform the most recent algorithm NETS.

[1]  Le Gruenwald,et al.  DBOD-DS: Distance Based Outlier Detection for Data Streams , 2010, DEXA.

[2]  Yannis Manolopoulos,et al.  Continuous monitoring of distance-based outliers over data streams , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[3]  Dimitrios Gunopulos,et al.  Online outlier detection in sensor data using non-parametric models , 2006, VLDB.

[4]  Byung Suk Lee,et al.  NETS: Extremely Fast Outlier Detection from a Data Stream via Set-Based Processing , 2019, Proc. VLDB Endow..

[5]  Yannis Manolopoulos,et al.  Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms , 2013, SIGMOD '13.

[6]  Gerald Quirchmayr,et al.  Database and Expert Systems Applications, 21st International Conference, DEXA 2010, Bilbao, Spain, August 30 - September 3, 2010, Proceedings, Part I , 2010, DEXA.

[7]  Matthew O. Ward,et al.  Neighbor-based pattern detection for windows over streaming data , 2009, EDBT '09.

[8]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[9]  Lei Cao,et al.  Scalable distance-based outlier detection over high-volume data streams , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[10]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[11]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[12]  Wenfei Fan,et al.  Keys with Upward Wildcards for XML , 2001, DEXA.

[13]  Cyrus Shahabi,et al.  Distance-based Outlier Detection in Data Streams , 2016, Proc. VLDB Endow..