Exploiting Local Data Uncertainty to Boost Global Outlier Detection

This paper presents a novel hybrid approach to outlier detection by incorporating local data uncertainty into the construction of a global classifier. To deal with local data uncertainty, we introduce a confidence value to each data example in the training data, which measures the strength of the corresponding class label. Our proposed method works in two steps. Firstly, we generate a pseudo training dataset by computing a confidence value of each input example on its class label. We present two different mechanisms: kernel k-means clustering algorithm and kernel LOF-based algorithm, to compute the confidence values based on the local data behavior. Secondly, we construct a global classifier for outlier detection by generalizing the SVDD-based learning framework to incorporate both positive and negative examples as well as their associated confidence values. By integrating local and global outlier detection, our proposed method explicitly handles the uncertainty of the input data and enhances the ability of SVDD in reducing the sensitivity to noise. Extensive experiments on real life datasets demonstrate that our proposed method can achieve a better tradeoff between detection rate and false alarm rate as compared to four state-of-the-art outlier detection algorithms.

[1]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[2]  Yi Lin,et al.  Support Vector Machines for Classification in Nonstandard Situations , 2002, Machine Learning.

[3]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[4]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[5]  Robert P. W. Duin,et al.  Outlier Detection Using Classifier Instability , 1998, SSPR/SPR.

[6]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[7]  Don R. Hush,et al.  A Classification Framework for Anomaly Detection , 2005, J. Mach. Learn. Res..

[8]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[9]  Sheng-yi Jiang,et al.  Clustering-Based Outlier Detection Method , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[10]  Jinbo Bi,et al.  Support Vector Classification with Input Data Uncertainty , 2004, NIPS.

[11]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[12]  James Theiler,et al.  Resampling approach for anomaly detection in multispectral images , 2003, SPIE Defense + Commercial Sensing.

[13]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[14]  Michalis Vazirgiannis,et al.  Cluster validity methods: part I , 2002, SGMD.

[15]  Jieping Ye,et al.  A Small Sphere and Large Margin Approach for Novelty Detection Using Training Data with Outliers , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[17]  Gholamreza Nakhaeizadeh,et al.  Cost-Sensitive Pruning of Decision Trees , 1994, ECML.

[18]  Clara Pizzuti,et al.  Outlier mining in large high-dimensional data sets , 2005, IEEE Transactions on Knowledge and Data Engineering.

[19]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[20]  Guido Smits,et al.  Robust outlier detection using SVM regression , 2004, 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541).

[21]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[22]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[23]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[24]  Fabio Roli,et al.  Cost-sensitive Learning in Support Vector Machines , 2002 .

[25]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[26]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.