A validity index for outlier detection

Defining a boundary between inliers and outliers is a major challenge in unsupervised outlier detection. In the absence of labeled data, the true outliers set cannot be evaluated. This lays the burden on both the choice of an efficient outlier detection criterion, and parameter selection. While numerous unsupervised outlier detection criteria, with different parameters, have been proposed, an unsupervised evaluation of outliers is still missing. This work introduces a theoretical basis, and proposes a validity index, to evaluate the quality of outliers. This is not a trivial problem when nothing is known about the structure and density of the data. The proposed index considers the outlierness quality, the deviation between characteristics of outliers and inliers, and the data distortion. Low and high dimensional data sets are used to evaluate the proposed index.

[1]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[2]  Mohamed A. Ismail,et al.  A novel validity measure for clusters of arbitrary shapes and densities , 2008, 2008 19th International Conference on Pattern Recognition.

[3]  Mohamed A. Ismail,et al.  Fuzzy outlier analysis a combined clustering - outlier detection approach , 2007, 2007 IEEE International Conference on Systems, Man and Cybernetics.

[4]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[5]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[6]  Su Yang,et al.  LDBOD: A novel local distribution based outlier detector , 2008, Pattern Recognit. Lett..

[7]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[8]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[9]  Hui Wang,et al.  GLOF: a new approach for mining local outlier , 2003, Proceedings of the 2003 International Conference on Machine Learning and Cybernetics (IEEE Cat. No.03EX693).

[10]  Mohamed A. Ismail,et al.  A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities , 2009, Pattern Recognit..