A vertical distance-based outlier detection method with local pruning

"One person's noise is another person's signal". Outlier detection is used to clean up datasets and also to discover useful anomalies, such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. Thus, outlier detection is critically important in the information-based society. This paper focuses on finding outliers in large datasets using distance-based methods. First, to speedup outlier detections, we revise Knorr and Ng's distance-based outlier definition; second, a vertical data structure, instead of traditional horizontal structures, is adopted to facilitate efficient outlier detection further. We tested our methods against national hockey league dataset and show an order of magnitude of speed improvement compared to the contemporary distance-based outlier detection approaches.

[1]  Michael Sannella,et al.  Constraint satisfaction and debugging for interactive user interfaces , 1994 .

[2]  William Perrizo,et al.  An optimized approach for KNN text categorization using P-trees , 2004, SAC '04.

[3]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[4]  ShimKyuseok,et al.  Efficient algorithms for mining outliers from large data sets , 2000 .

[5]  Qin Ding,et al.  k-nearest Neighbor Classification on Spatial Data Streams Using P-trees , 2002, PAKDD.

[6]  Bernd Fröhlich,et al.  The cubic mouse: a new device for three-dimensional input , 2000, CHI.

[7]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[8]  Shashi Shekhar,et al.  Detecting graph-based spatial outliers , 2002, Intell. Data Anal..

[9]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[10]  William Perrizo,et al.  Efficient Ranking of Keyword Queries Using P-trees , 2004, International Conference on Computers and Their Applications.

[11]  Qin Ding,et al.  The P-tree algebra , 2002, SAC '02.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  Leslie Lamport,et al.  LaTeX User''''s Guide and Document Reference Manual , 1986 .

[14]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[15]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[16]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[17]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[18]  Gary Marchionini,et al.  A study on video browsing strategies , 1997 .

[19]  Larry L. Peterson,et al.  Reasoning about naming systems , 1993, TOPL.

[20]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..