SVDD-based outlier detection on uncertain data

Outlier detection is an important problem that has been studied within diverse research areas and application domains. Most existing methods are based on the assumption that an example can be exactly categorized as either a normal class or an outlier. However, in many real-life applications, data are uncertain in nature due to various errors or partial completeness. These data uncertainty make the detection of outliers far more difficult than it is from clearly separable data. The key challenge of handling uncertain data in outlier detection is how to reduce the impact of uncertain data on the learned distinctive classifier. This paper proposes a new SVDD-based approach to detect outliers on uncertain data. The proposed approach operates in two steps. In the first step, a pseudo-training set is generated by assigning a confidence score to each input example, which indicates the likelihood of an example tending normal class. In the second step, the generated confidence score is incorporated into the support vector data description training phase to construct a global distinctive classifier for outlier detection. In this phase, the contribution of the examples with the least confidence score on the construction of the decision boundary has been reduced. The experiments show that the proposed approach outperforms state-of-art outlier detection techniques.

[1]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[2]  Xindong Wu,et al.  Multi-sphere Support Vector Data Description for Outliers Detection on Multi-distribution Data , 2009, 2009 IEEE International Conference on Data Mining Workshops.

[3]  J.S.H. Tsai,et al.  A boundary method for outlier detection based on support vector domain description , 2009, Pattern Recognit..

[4]  Jaap A. Kaandorp,et al.  Proceedings of the fifth annual conference of the Advanced School for Computing and Imaging ASCI , 1999 .

[5]  Robert P. W. Duin,et al.  Support Vector Data Description , 2004, Machine Learning.

[6]  Florian Metze,et al.  Generalized radial basis function networks for classification and novelty detection: self-organization of optimal Bayesian decision , 2000, Neural Networks.

[7]  C. Li,et al.  Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[8]  Kotagiri Ramamohanarao,et al.  Layered Approach Using Conditional Random Fields for Intrusion Detection , 2010, IEEE Transactions on Dependable and Secure Computing.

[9]  Daniel T. Larose,et al.  Discovering Knowledge in Data: An Introduction to Data Mining , 2005 .

[10]  Sushil Jajodia,et al.  ADAM: a testbed for exploring the use of data mining in intrusion detection , 2001, SGMD.

[11]  Sushil Jajodia,et al.  Detecting Novel Network Intrusions Using Bayes Estimators , 2001, SDM.

[12]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[13]  Takafumi Kanamori,et al.  Statistical outlier detection using direct density ratio estimation , 2011, Knowledge and Information Systems.

[14]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[15]  Sheng-yi Jiang,et al.  Clustering-Based Outlier Detection Method , 2008, 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery.

[16]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[17]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[18]  H. P. Huang,et al.  Fuzzy Support Vector Machines for Pattern Recognition and Data Mining , 2002 .

[19]  Feifei Li,et al.  Finding frequent items in probabilistic data , 2008, SIGMOD Conference.

[20]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[21]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[22]  Jim Austin,et al.  Novelty detection for strain-gauge degradation using maximally correlated components , 2002, ESANN.

[23]  Charu C. Aggarwal,et al.  Managing and Mining Uncertain Data , 2009, Advances in Database Systems.

[24]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[25]  Sheng-De Wang,et al.  Fuzzy support vector machines , 2002, IEEE Trans. Neural Networks.

[26]  Daling Wang,et al.  CD-Trees: An Efficient Index Structure for Outlier Detection , 2004, WAIM.

[27]  G. Box,et al.  Bayesian analysis of some outlier problems in time series , 1979 .

[28]  Taghi M. Khoshgoftaar,et al.  The pairwise attribute noise detection algorithm , 2007, Knowledge and Information Systems.

[29]  Osmar R. Zaïane,et al.  Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data , 2008, Knowledge and Information Systems.

[30]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[31]  Jon T. S. Quah,et al.  Real-time credit card fraud detection using computational intelligence , 2008, Expert Syst. Appl..

[32]  Weili. Ong,et al.  Real time credit card fraud detection using computational intelligence , 2011 .

[33]  Gilbert L. Peterson,et al.  The importance of generalizability for anomaly detection , 2007, Knowledge and Information Systems.

[34]  Defeng Wang,et al.  Structured One-Class Classification , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[35]  Deepak Agarwal,et al.  Detecting anomalies in cross-classified streams: a Bayesian approach , 2006, Knowledge and Information Systems.

[36]  Sunil Prabhakar,et al.  Evaluating probabilistic queries over imprecise data , 2003, SIGMOD '03.

[37]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[38]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[39]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[40]  Dae-Won Kim,et al.  Density-Induced Support Vector Data Description , 2007, IEEE Transactions on Neural Networks.

[41]  H. E. Solberg,et al.  Detection of outliers in reference distributions: performance of Horn's algorithm. , 2005, Clinical chemistry.

[42]  Lionel Tarassenko,et al.  The use of novelty detection techniques for monitoring high-integrity plant , 2002, Proceedings of the International Conference on Control Applications.

[43]  Yong Shi,et al.  COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis , 2011, Knowledge and Information Systems.

[44]  Dan Suciu,et al.  Efficient query evaluation on probabilistic databases , 2004, The VLDB Journal.

[45]  Hongxing He,et al.  A comparative study of RNN for outlier detection in data mining , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[46]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[47]  Philip K. Chan,et al.  Learning rules for anomaly detection of hostile network traffic , 2003, Third IEEE International Conference on Data Mining.

[48]  Felix Naumann,et al.  Data fusion , 2009, CSUR.

[49]  Masatoshi Yoshikawa,et al.  D-Search: an efficient and exact search algorithm for large distribution sets , 2010, Knowledge and Information Systems.

[50]  Anil K. Jain,et al.  Algorithms for Clustering Data , 1988 .

[51]  X. Shao,et al.  Simultaneous Wavelength Selection and Outlier Detection in Multivariate Regression of Near-Infrared Spectra , 2005, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[52]  Anne M. Denton,et al.  Subspace sums for extracting non-random data from massive noise , 2009, Knowledge and Information Systems.

[53]  Osmar R. Zaïane,et al.  Knowledge and Information Systems Class Separation through Variance : A new Application of Outlier Detection , 2010 .

[54]  Dragoljub Pokrajac,et al.  Outlier Detection with Globally Optimal Exemplar-Based GMM , 2009, SDM.

[55]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[56]  San-Yih Hwang,et al.  A process-mining framework for the detection of healthcare fraud and abuse , 2006, Expert Syst. Appl..

[57]  Charu C. Aggarwal,et al.  On Density Based Transforms for Uncertain Data Mining , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[58]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.