Anomaly Detection in High-Dimensional Data

The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this article, we propose an algorithm that addresses these limitations. We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. An approach based on extreme value theory is used for the anomalous threshold calculation. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time. This framework is implemented in the open source R package stray.

[1]  Ishay Weissman,et al.  Estimation of parameters and large quantiles based on the K largest observations , 1978, Advances in Applied Probability.

[2]  A. Hadi Identifying Multiple Outliers in Multivariate Data , 1992 .

[3]  Ibrahim Mohamed,et al.  Detection of outliers in simple circular regression models using the mean circular error statistic , 2013 .

[4]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[5]  Saad B. Qaisar,et al.  Characteristics and classification of outlier detection techniques for wireless sensor networks in harsh environments: a survey , 2012, Artificial Intelligence Review.

[6]  Kate Smith-Miles,et al.  Anomaly Detection in Streaming Nonstationary Temporal Data , 2020, Journal of Computational and Graphical Statistics.

[7]  Arthur Zimek,et al.  On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , 2016, Data Mining and Knowledge Discovery.

[8]  Jianhua Wu,et al.  An Improved Method of Outlier Detection Based on Frequent Pattern , 2010, 2010 WASE International Conference on Information Engineering.

[9]  Ali Movaghar-Rahimabadi,et al.  Intrusion Detection: A Survey , 2008, 2008 Third International Conference on Systems and Networks Communications.

[10]  Karen Kafadar,et al.  Letter-Value Plots: Boxplots for Large Data , 2017 .

[11]  Leland Wilkinson,et al.  Visualizing Big Data Outliers Through Distributed Aggregation , 2018, IEEE Transactions on Visualization and Computer Graphics.

[12]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[13]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[14]  Charu C. Aggarwal,et al.  Outlier Detection for Temporal Data: A Survey , 2014, IEEE Transactions on Knowledge and Data Engineering.

[15]  Kate Smith-Miles,et al.  On normalization and algorithm selection for unsupervised outlier detection , 2019, Data Mining and Knowledge Discovery.

[16]  Ching-Yung Lin,et al.  TargetVue: Visual Analysis of Anomalous User Behaviors in Online Communication Systems , 2016, IEEE Transactions on Visualization and Computer Graphics.

[17]  David A. Clifton,et al.  A review of novelty detection , 2014, Signal Process..

[18]  Rob J. Hyndman,et al.  Large-Scale Unusual Time Series Detection , 2015, 2015 IEEE International Conference on Data Mining Workshop (ICDMW).

[19]  Philip S. Yu,et al.  Outlier Detection with Uncertain Data , 2008, SDM.

[20]  Rob J Hyndman,et al.  Computing and Graphing Highest Density Regions , 1996 .

[21]  David A. Clifton,et al.  Aircraft engine health monitoring using density modelling and extreme value statistics , 2009 .

[22]  Kate Smith-Miles,et al.  Towards objective measures of algorithm performance across instance space , 2014, Comput. Oper. Res..

[23]  Chang-Tien Lu,et al.  Outlier Detection , 2008, Encyclopedia of GIS.

[24]  Kristopher T. Williams Local parametric density-based outlier detection and ensemble learning with applications to malware detection , 2016 .

[25]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[26]  John A. Hartigan,et al.  Clustering Algorithms , 1975 .

[27]  Katherine Tracy Schwarz,et al.  Wind dispersion of carbon dioxide leaking from underground sequestration, and outlier detection in eddy covariance data using extreme value theory , 2008 .

[28]  Valerio Pascucci,et al.  Visualizing High-Dimensional Data: Advances in the Past Decade , 2017, IEEE Transactions on Visualization and Computer Graphics.

[29]  PAUL EMBRECHTS,et al.  Modelling of extremal events in insurance and finance , 1994, Math. Methods Oper. Res..

[30]  J. Corcoran Modelling Extremal Events for Insurance and Finance , 2002 .

[31]  R. Grossman,et al.  Graph-theoretic scagnostics , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[32]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[33]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[34]  Kate Smith-Miles,et al.  A Feature‐Based Procedure for Detecting Technical Outliers in Water‐Quality Data From In Situ Sensors , 2019, Water Resources Research.

[35]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[36]  Antony Unwin,et al.  Multivariate Outliers and the O3 Plot , 2019, Journal of Computational and Graphical Statistics.

[37]  E. Ziegel Extreme Value Theory and Applications , 1994 .

[38]  Seiichi Uchida,et al.  A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data , 2016, PloS one.

[39]  Xiaofeng Zhu,et al.  Efficient kNN Classification With Different Numbers of Nearest Neighbors , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[40]  D. Massart,et al.  Detection of prediction outliers and inliers in multivariate calibration , 1999 .

[41]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[42]  Helwig Hauser,et al.  Outlier-Preserving Focus+Context Visualization in Parallel Coordinates , 2006, IEEE Transactions on Visualization and Computer Graphics.

[43]  Roland Siegwart,et al.  Comparison of nearest-neighbor-search strategies and implementations for efficient shape registration , 2012 .

[44]  I. Weissman Estimation of Parameters and Large Quantiles Based on the k Largest Observations , 1978 .

[45]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[46]  Leland Wilkinson,et al.  Transforming Scagnostics to Reveal Hidden Features , 2014, IEEE Transactions on Visualization and Computer Graphics.

[47]  Kerrie Mengersen,et al.  A framework for automated anomaly detection in high frequency water-quality data from in situ sensors. , 2018, The Science of the total environment.

[48]  Kate Smith-Miles,et al.  Visualising forecasting algorithm performance using time series instance spaces , 2017 .

[49]  P. Burridge,et al.  Additive Outlier Detection Via Extreme‐Value Theory , 2006 .

[50]  David A. Clifton,et al.  Novelty Detection with Multivariate Extreme Value Statistics , 2011, J. Signal Process. Syst..

[51]  Xiaoqin Zhang,et al.  RKOF: Robust Kernel-Based Local Outlier Detection , 2011, PAKDD.