Generating Artificial Outliers in the Absence of Genuine Ones — A Survey

By definition, outliers are rarely observed in reality, making them difficult to detect or analyse. Artificial outliers approximate such genuine outliers and can, for instance, help with the detection of genuine outliers or with benchmarking outlier-detection algorithms. The literature features different approaches to generate artificial outliers. However, systematic comparison of these approaches remains absent. This surveys and compares these approaches. We start by clarifying the terminology in the field, which varies from publication to publication, and we propose a general problem formulation. Our description of the connection of generating outliers to other research fields like experimental design or generative models frames the field of artificial outliers. Along with offering a concise description, we group the approaches by their general concepts and how they make use of genuine instances. An extensive experimental study reveals the differences between the generation approaches when ultimately being used for outlier detection. This survey shows that the existing approaches already cover a wide range of concepts underlying the generation, but also that the field still has potential for further development. Our experimental study does confirm the expectation that the quality of the generation approaches varies widely, for example, in terms of the data set they are used on. Ultimately, to guide the choice of the generation approach in a specific context, we propose an appropriate general-decision process. In summary, this survey comprises, describes, and connects all relevant work regarding the generation of artificial outliers and may serve as a basis to guide further research in the field.

[1]  Sonja Kuhnt,et al.  Design and analysis of computer experiments , 2010 .

[2]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[3]  Roger Xu,et al.  Model Selection for Anomaly Detection in Wireless Ad Hoc Networks , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[4]  H. B. Mann,et al.  On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other , 1947 .

[5]  Fan Yang,et al.  Good Semi-supervised Learning That Requires a Bad GAN , 2017, NIPS.

[6]  Salvatore J. Stolfo,et al.  Using artificial anomalies to detect unknown and known network intrusions , 2003, Knowledge and Information Systems.

[7]  Chih-Jen Lin,et al.  Training v-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Computation.

[8]  Ian H. Witten,et al.  One-Class Classification by Combining Density and Class Probability Estimation , 2008, ECML/PKDD.

[9]  Caroline Petitjean,et al.  One class random forests , 2013, Pattern Recognit..

[10]  Richard Baraniuk,et al.  Learning Minimum Volume Sets with Support Vector Machines , 2006, 2006 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing.

[11]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[12]  James Theiler,et al.  Resampling approach for anomaly detection in multispectral images , 2003, SPIE Defense + Commercial Sensing.

[13]  Stefan Berchtold,et al.  Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets , 2003, IEEE Trans. Knowl. Data Eng..

[14]  András Kocsor,et al.  Counter-Example Generation-Based One-Class Classification , 2007, ECML.

[15]  Thomas Larsson Fast and Tight Fitting Bounding Spheres , 2008 .

[16]  Sameep Mehta,et al.  An Introduction to Adversarial Machine Learning , 2017, BDA.

[17]  D. Dasgupta,et al.  Combining negative selection and classification techniques for anomaly detection , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[18]  Chih-Jen Lin,et al.  Training nu-Support Vector Classifiers: Theory and Algorithms , 2001, Neural Comput..

[19]  Malcolm I. Heywood,et al.  One-Class Genetic Programming , 2009, EuroGP.

[20]  B. Minasny The Elements of Statistical Learning, Second Edition, Trevor Hastie, Robert Tishirani, Jerome Friedman. (2009), Springer Series in Statistics, ISBN 0172-7397, 745 pp , 2009 .

[21]  A. Asuncion,et al.  UCI Machine Learning Repository, University of California, Irvine, School of Information and Computer Sciences , 2007 .

[22]  Miodrag Lovric,et al.  International Encyclopedia of Statistical Science , 2011 .

[23]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[24]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[25]  Ran El-Yaniv,et al.  Optimal Single-Class Classification Strategies , 2006, NIPS.

[26]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[27]  Christoph H. Lampert Kernel Methods in Computer Vision , 2009, Found. Trends Comput. Graph. Vis..

[28]  Yi-Hung Liu,et al.  A novel approach to generate artificial outliers for support vector data description , 2009, 2009 IEEE International Symposium on Industrial Electronics.

[29]  Robert P. W. Duin,et al.  Uniform Object Generation for Optimizing One-class Classifiers , 2002, J. Mach. Learn. Res..

[30]  Oliver Kramer,et al.  Instance Selection and Outlier Generation to Improve the Cascade Classifier Precision , 2016, ICAART.

[31]  Yuhua Li,et al.  Selecting Critical Patterns Based on Local Geometrical and Statistical Information , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  A. P. Dawid,et al.  Generative or Discriminative? Getting the Best of Both Worlds , 2007 .

[33]  Robert P. W. Duin,et al.  Support vector domain description , 1999, Pattern Recognit. Lett..

[34]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[35]  Don R. Hush,et al.  A Classification Framework for Anomaly Detection , 2005, J. Mach. Learn. Res..

[36]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[37]  Mary L McHugh,et al.  Multiple comparison analysis testing in ANOVA. , 2011, Biochemia medica.

[38]  Klemens Böhm,et al.  Hiding outliers in high-dimensional data spaces , 2017, International Journal of Data Science and Analytics.

[39]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[40]  Qiang Liu,et al.  Hyperparameter selection of one-class support vector machine by self-adaptive data shifting , 2018, Pattern Recognit..

[41]  Fabio Roli,et al.  Wild Patterns: Ten Years After the Rise of Adversarial Machine Learning , 2018, CCS.

[42]  Alan S. Perelson,et al.  Self-nonself discrimination in a computer , 1994, Proceedings of 1994 IEEE Computer Society Symposium on Research in Security and Privacy.

[43]  Klemens Böhm,et al.  Outlier Ranking via Subspace Analysis in Multiple Views of the Data , 2012, 2012 IEEE 12th International Conference on Data Mining.

[44]  Hans-Peter Piepho,et al.  An Algorithm for a Letter-Based Representation of All-Pairwise Comparisons , 2004 .

[45]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[46]  Nguyen Xuan Hoai,et al.  Generating artificial attack data for intrusion detection using machine learning , 2014, SoICT.

[47]  Malcolm I. Heywood,et al.  Scaling Genetic Programming to Large Datasets Using Hierarchical Dynamic Subset Selection , 2007, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[48]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[49]  S. Horvath,et al.  Unsupervised Learning With Random Forest Predictors , 2006 .

[50]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[51]  Matthias Bethge,et al.  Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models , 2017, ICLR.

[52]  J. Algina,et al.  Generalized eta and omega squared statistics: measures of effect size for some common research designs. , 2003, Psychological methods.

[53]  Fabio A. González,et al.  Anomaly Detection Using Real-Valued Negative Selection , 2003, Genetic Programming and Evolvable Machines.