Interpreting and Unifying Outlier Scores

Outlier scores provided by different outlier models differ widely in their meaning, range, and contrast between different outlier models and, hence, are not easily comparable or interpretable. We propose a unification of outlier scores provided by various outlier models and a translation of the arbitrary “outlier factors” to values in the range [0, 1] interpretable as values describing the probability of a data object of being an outlier. As an application, we show that this unification facilitates enhanced ensembles for outlier detection.

[1]  Vipin Kumar,et al.  Evaluating boosting algorithms to classify rare classes: comparison and improvements , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[2]  Gary M. Weiss,et al.  Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? , 2007, DMIN.

[3]  JapkowiczNathalie,et al.  The class imbalance problem: A systematic study , 2002 .

[4]  David A. Cieslak,et al.  Automatically countering imbalance and its empirical relationship to cost , 2008, Data Mining and Knowledge Discovery.

[5]  Sanjay Chawla,et al.  On local spatial outliers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[6]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[7]  Tadeusz Pietraszek,et al.  Optimizing abstaining classifiers using ROC analysis , 2005, ICML.

[8]  A. H. Murphy,et al.  A Note on the Utility of Probabilistic Predictions and the Probability Score in the Cost-Loss Ratio Decision Situation , 1966 .

[9]  Adam Kowalczyk,et al.  Extreme re-balancing for SVMs: a case study , 2004, SKDD.

[10]  Michel Verleysen,et al.  Improving the Robustness to Outliers of Mixtures of Probabilistic PCAs , 2008, PAKDD.

[11]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[12]  Hans-Peter Kriegel,et al.  LoOP: local outlier probabilities , 2009, CIKM.

[13]  Hans-Peter Kriegel,et al.  A General Framework for Increasing the Robustness of PCA-Based Correlation Clustering Algorithms , 2008, SSDBM.

[14]  Frederick Sanders The Verification of Probability Forecasts , 1967 .

[15]  E. Rowland Theory of Games and Economic Behavior , 1946, Nature.

[16]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[17]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Raymond T. Ng,et al.  A Unified Notion of Outliers: Properties and Computation , 1997, KDD.

[20]  Hans-Peter Kriegel,et al.  Angle-based outlier detection in high-dimensional data , 2008, KDD.

[21]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[22]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[23]  Bianca Zadrozny,et al.  Learning and making decisions when costs and probabilities are both unknown , 2001, KDD '01.

[24]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[25]  Clara Pizzuti,et al.  Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[26]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[27]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[28]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[29]  Osmar R. Zaïane,et al.  A Nonparametric Outlier Detection for Effectively Discovering Top-N Outliers from Engineering Data , 2006, PAKDD.

[30]  Theodore Johnson,et al.  Fast Computation of 2-Dimensional Depth Contours , 1998, KDD.

[31]  P. Rousseeuw,et al.  Computing depth contours of bivariate point clouds , 1996 .

[32]  David M. Rocke,et al.  Outlier detection in the multiple cluster setting using the minimum covariance determinant estimator , 2004, Comput. Stat. Data Anal..

[33]  A. H. Murphy,et al.  Verification of Probabilistic Predictions: A Brief Review , 1967 .

[34]  Arnold W. M. Smeulders,et al.  The Amsterdam Library of Object Images , 2004, International Journal of Computer Vision.

[35]  Nimrod Megiddo,et al.  Discovery-Driven Exploration of OLAP Data Cubes , 1998, EDBT.

[36]  M. Braga,et al.  Exploratory Data Analysis , 2018, Encyclopedia of Social Network Analysis and Mining. 2nd Ed..

[37]  A. Madansky Identification of Outliers , 1988 .

[38]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[39]  P. Bartlett,et al.  Probabilities for SV Machines , 2000 .

[40]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[41]  Katrien van Driessen,et al.  A Fast Algorithm for the Minimum Covariance Determinant Estimator , 1999, Technometrics.

[42]  Nathalie Japkowicz,et al.  The class imbalance problem: A systematic study , 2002, Intell. Data Anal..

[43]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[44]  Elke Achtert,et al.  Visual Evaluation of Outlier Detection Models , 2010, DASFAA.

[45]  A. H. Murphy,et al.  Reliability of Subjective Probability Forecasts of Precipitation and Temperature , 1977 .

[46]  Pedro M. Domingos MetaCost: a general method for making classifiers cost-sensitive , 1999, KDD '99.

[47]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[48]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[49]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[50]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[51]  Sanjay Chawla,et al.  Finding Local Anomalies in Very High Dimensional Space , 2010, 2010 IEEE International Conference on Data Mining.

[52]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[53]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[54]  Osmar R. Zaïane,et al.  An Efficient Reference-Based Approach to Outlier Detection in Large Datasets , 2006, Sixth International Conference on Data Mining (ICDM'06).

[55]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[56]  A. H. Murphy,et al.  Scoring rules and the evaluation of probabilities , 1996 .

[57]  Taeho Jo,et al.  Class imbalances versus small disjuncts , 2004, SKDD.