A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes

Outlier detection has attracted substantial attention in many applications and research areas; some of the most prominent applications are network intrusion detection or credit card fraud detection. Many of the existing approaches are based on calculating distances among the points in the dataset. These approaches cannot easily adapt to current datasets that usually contain a mix of categorical and continuous attributes, and may be distributed among different geographical locations. In addition, current datasets usually have a large number of dimensions. These datasets tend to be sparse, and traditional concepts such as Euclidean distance or nearest neighbor become unsuitable. We propose a fast distributed outlier detection strategy intended for datasets containing mixed attributes. The proposed method takes into consideration the sparseness of the dataset, and is experimentally shown to be highly scalable with the number of points and the number of attributes in the dataset. Experimental results show that the proposed outlier detection method compares very favorably with other state-of-the art outlier detection strategies proposed in the literature and that the speedup achieved by its distributed version is very close to linear.

[1]  ParthasarathySrinivasan,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006 .

[2]  Raghu Ramakrishnan,et al.  When Is Nearest Neighbors Indexable? , 2005, ICDT.

[3]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[4]  E. Acuña,et al.  A Meta analysis study of outlier detection methods in classification , 2004 .

[5]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[6]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[7]  Georgios C. Anagnostopoulos,et al.  Detecting Outliers in High-Dimensional Datasets with Mixed Attributes , 2008, DMIN.

[8]  Zhan Yong-zhao Support vector data description discriminant analysis , 2011 .

[9]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[10]  David J. Hand,et al.  Statistical fraud detection: A review , 2002 .

[11]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[12]  Srinivasan Parthasarathy,et al.  Toward unsupervised correlation preserving discretization , 2005, IEEE Transactions on Knowledge and Data Engineering.

[13]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[14]  Jean-François Boulicaut,et al.  A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[15]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[16]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[17]  David W. Aha,et al.  Feature Selection for Case-Based Classification of Cloud Types: An Empirical Comparison , 1994 .

[18]  Jaideep Srivastava,et al.  Data Mining for Network Intrusion Detection , 2002 .

[19]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[20]  Alan F. Murray,et al.  International Joint Conference on Neural Networks , 1993 .

[21]  Vipin Kumar,et al.  Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data , 2003, SDM.

[22]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[23]  Aleksandar Lazarevic,et al.  Outlier Detection with Kernel Density Functions , 2007, MLDM.

[24]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[25]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[26]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[27]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[28]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[29]  Bart Goethals,et al.  Tight upper bounds on the number of candidate patterns , 2005, TODS.

[30]  Tao Xiong,et al.  A combined SVM and LDA approach for classification , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[31]  Ran Wolff,et al.  In-Network Outlier Detection in Wireless Sensor Networks , 2006, ICDCS.

[32]  Stephen J. Roberts,et al.  A Probabilistic Resource Allocating Network for Novelty Detection , 1994, Neural Computation.

[33]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[34]  Kay I Penny,et al.  A comparison of multivariate outlier detection methods for clinical laboratory safety data , 2001 .

[35]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[36]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[37]  Vipin Kumar,et al.  Introduction to Data Mining, (First Edition) , 2005 .

[38]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[39]  Michael Georgiopoulos,et al.  Fast parallel outlier detection for categorical datasets using MapReduce , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[40]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[41]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[42]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[43]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[44]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[45]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007 .

[46]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[47]  Sanjay Ghemawat,et al.  MapReduce: simplified data processing on large clusters , 2008, CACM.

[48]  Hongjun Lu,et al.  Finding centric local outliers in categorical/numerical spaces , 2006, Knowledge and Information Systems.

[49]  Costas S. Tzafestas,et al.  Maximum Likelihood SLAM in Dynamic Environments , 2007 .

[50]  M.A. Akbar,et al.  A comparative study of anomaly detection algorithms for detection of SIP flooding in IMS , 2008, 2008 2nd International Conference on Internet Multimedia Services Architecture and Applications.

[51]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[52]  Stefano Ferilli,et al.  Unsupervised Discretization Using Kernel Density Estimation , 2007, IJCAI.

[53]  P. Rousseeuw Multivariate estimation with high breakdown point , 1985 .