A Mixture Model-Based Combination Approach for Outlier Detection

In this paper, we propose an approach that combines different outlier detection algorithms in order to gain an improved effectiveness. To this end, we first estimate an outlier score vector for each data object. Each element of the estimated vectors corresponds to an outlier score produced by a specific outlier detection algorithm. We then use the multivariate beta mixture model to cluster the outlier score vectors into several components so that the component that corresponds to the outliers can be identified. A notable feature of the proposed approach is the automatic identification of outliers, while most existing methods return only a ranked list of points, expecting the outliers to come first; or require empirical threshold estimation to identify outliers. Experimental results, on both synthetic and real data sets, show that our approach substantially enhances the accuracy of outlier base detectors considered in the combination and overcome their drawbacks.

[1]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[2]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[3]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[6]  B. S. Manjunath,et al.  The multiRANSAC algorithm and its application to detect planar homographies , 2005, IEEE International Conference on Image Processing 2005.

[7]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[8]  Michael Georgiopoulos,et al.  Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data , 2011, Knowledge and Information Systems.

[9]  L. J. Bain,et al.  Introduction to Probability and Mathematical Statistics , 1987 .

[10]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[11]  Elke Achtert,et al.  Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[12]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[13]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[14]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[15]  Mohamed Bouguessa,et al.  An Unsupervised Approach for Identifying Spammers in Social Networks , 2011, 2011 IEEE 23rd International Conference on Tools with Artificial Intelligence.

[16]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[17]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[18]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[19]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[20]  Hans-Peter Kriegel,et al.  On Evaluation of Outlier Rankings and Outlier Scores , 2012, SDM.

[21]  Shengrui Wang,et al.  An objective approach to cluster validation , 2006, Pattern Recognit. Lett..

[22]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[23]  Shengrui Wang,et al.  Mining Projected Clusters in High-Dimensional Spaces , 2009, IEEE Transactions on Knowledge and Data Engineering.

[24]  Arne Leijon,et al.  Beta mixture models and the application to image classification , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[25]  Ingram Olkin,et al.  Multivariate Beta Distributions and Independence Properties of the Wishart Distribution , 1964 .

[26]  Zhanyu Ma Non-Gaussian Statistical Modelsand Their Applications , 2011 .

[27]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[28]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[29]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.