Modeling Outlier Score Distributions

A common approach to outlier detection is to provide a ranked list of objects based on an estimated outlier score for each object. A major problem of such an approach is determining how many objects should be chosen as outlier from a ranked list. Other outlier detection methods, transform the outlier scores into probability values and then use a user-predefined threshold to identify outliers. Ad hoc threshold values, which are hard to justify, are often used. Outlier detection accuracy can be seriously reduced if an incorrect threshold value is used. To address these problems, we propose a formal approach to analyse the outlier scores in order to automatically discriminate between outliers and inliers. Specifically, we devise a probabilistic approach to model the score distributions of outlier scoring algorithms. The probability density function of the outlier scores is therefore estimated and the outlier objects are automatically identified.

[1]  Nizar Bouguila,et al.  Practical Bayesian estimation of a finite beta mixture through gibbs sampling and its applications , 2006, Stat. Comput..

[2]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[3]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[4]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[5]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[6]  L. J. Bain,et al.  Introduction to Probability and Mathematical Statistics , 1987 .

[7]  James C. Bezdek,et al.  Pattern Recognition with Fuzzy Objective Function Algorithms , 1981, Advanced Applications in Pattern Recognition.

[8]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[9]  Jan Komorowski,et al.  Principles of Data Mining and Knowledge Discovery , 2001, Lecture Notes in Computer Science.

[10]  Arne Leijon,et al.  Beta mixture models and the application to image classification , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[11]  Hans-Peter Kriegel,et al.  OPTICS-OF: Identifying Local Outliers , 1999, PKDD.

[12]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[13]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[14]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[15]  B. S. Manjunath,et al.  The multiRANSAC algorithm and its application to detect planar homographies , 2005, IEEE International Conference on Image Processing 2005.

[16]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[17]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[18]  Osmar R. Zaïane,et al.  An Efficient Reference-Based Approach to Outlier Detection in Large Datasets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[19]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[20]  Elke Achtert,et al.  Evaluation of Clusterings -- Metrics and Visual Support , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[21]  Hans-Peter Kriegel,et al.  Interpreting and Unifying Outlier Scores , 2011, SDM.

[22]  Graham J. Williams,et al.  On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms , 2000, KDD '00.

[23]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.