Outlier detection from large distributed databases

In this paper, we present an innovative system, coined as DISTROD (a.k.a DISTRibuted Outlier Detector), for detecting outliers, namely abnormal instances or observations, from multiple large distributed databases. DISTROD is able to effectively detect the so-called global outliers from distributed databases that are consistent with those produced by the centralized detection paradigm. DISTROD is equipped with a number of optimization/boosting strategies which empower it to significantly enhance its speed performance and reduce its communication overhead. Experimental evaluation demonstrates the good performance of DISTROD in terms of speed and communication overhead.

[1]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[2]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[3]  Tok Wang Ling,et al.  HOS-Miner: A System for Detecting Outlying Subspaces of High-dimensional Data , 2004, VLDB.

[4]  Bo Sheng,et al.  Outlier detection in sensor networks , 2007, MobiHoc '07.

[5]  Jiawei Han,et al.  Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[6]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[7]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[8]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[9]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[10]  Wenjiang Huang,et al.  A Novel Outlier Detection Algorithm for Distributed Databases , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[11]  Mark Crovella,et al.  Distributed Spatial Anomaly Detection , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[12]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[13]  Daniel A. Keim,et al.  An Efficient Approach to Clustering in Large Multimedia Databases with Noise , 1998, KDD.

[14]  Xun Yi,et al.  Distributed Association Rule Mining with Minimum Communication Overhead , 2009, AusDM.

[15]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[16]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[17]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[18]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[19]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[20]  Ji Zhang,et al.  Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance , 2006, Knowledge and Information Systems.

[21]  Ji Zhang,et al.  Clustering in Dynamic Spatial Databases , 2005, Journal of Intelligent Information Systems.

[22]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[23]  Haimonti Dutta,et al.  Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System , 2007, SDM.

[24]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[25]  Liang Su,et al.  Continuous Adaptive Outlier Detection on Distributed Data Streams , 2007, HPCC.

[26]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.