Data Editing Techniques to Allow the Application of Distance-Based Outlier Detection to Streams

The problem of finding outliers in data has broad applications in areas as diverse as data cleaning, fraud detection, network monitoring, invasive species monitoring, etc. While there are dozens of techniques that have been proposed to solve this problem for static data collections, very simple distance-based outlier detection methods are known to be competitive or superior to more complex methods. However, distance-based methods have time and space complexities that make them impractical for streaming data and/or resource limited sensors. In this work, we show that simple data-editing techniques can make distance-based outlier detection practical for very fast streams and resource limited sensors. Our technique generalizes to produce two algorithms, which, relative to the original algorithm, can guarantee to produce no false positives, or guarantee to produce no false negatives. Our methods are independent of both data type and distance measure, and are thus broadly applicable.

[1]  Robert P. W. Duin,et al.  Prototype selection for dissimilarity-based classifiers , 2006, Pattern Recognit..

[2]  S. Ramey,et al.  Acknowledgement , 2000, NeuroImage.

[3]  Majid Sarrafzadeh,et al.  Unsupervised Discovery of Abnormal Activity Occurrences in Multi-dimensional Time Series, with Applications in Wearable Systems , 2010, SDM.

[4]  Volker Roth,et al.  Kernel Fisher Discriminants for Outlier Detection , 2006, Neural Computation.

[5]  Fabrizio Angiulli,et al.  Detecting distance-based outliers in streams of data , 2007, CIKM '07.

[6]  Li Wei,et al.  Fast time series classification using numerosity reduction , 2006, ICML.

[7]  Majid Sarrafzadeh,et al.  The SmartCane system: an assistive device for geriatrics , 2008, BODYNETS.

[8]  KeoghEamonn,et al.  Querying and mining of time series data , 2008, VLDB 2008.

[9]  Tony R. Martinez,et al.  Instance Pruning Techniques , 1997, ICML.

[10]  Gregory S. Hornby,et al.  Autonomous evolution of gaits with the Sony Quadruped Robot , 1999 .

[11]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[12]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[13]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[14]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[15]  Hui Ding,et al.  Querying and mining of time series data: experimental comparison of representations and distance measures , 2008, Proc. VLDB Endow..

[16]  F. J. Anscombe,et al.  Rejection of Outliers , 1960 .