Incorporating Expert Feedback into Active Anomaly Discovery

Unsupervised anomaly detection algorithms search for outliers and then predict that these outliers are the anomalies. When deployed, however, these algorithms are often criticized for high false positive and high false negative rates. One cause of poor performance is that not all outliers are anomalies and not all anomalies are outliers. In this paper, we describe an Active Anomaly Discovery (AAD) method for incorporating expert feedback to adjust the anomaly detector so that the outliers it discovers are more in tune with the expert user's semantic understanding of the anomalies. The AAD approach is designed to operate in an interactive data exploration loop. In each iteration of this loop, our algorithm first selects a data instance to present to the expert as a potential anomaly and then the expert labels the instance as an anomaly or as a nominal data point. Our algorithm updates its internal model with the instance label and the loop continues until a budget of B queries is spent. The goal of our approach is to maximize the total number of true anomalies in the B instances presented to the expert. We show that when compared to other state-of-the-art algorithms, AAD is consistently one of the best performers.

[1]  Yuval Elovici,et al.  ALPD: Active Learning Framework for Enhancing the Detection of Malicious PDF Files , 2014, 2014 IEEE Joint Intelligence and Security Informatics Conference.

[2]  Andrew W. Moore,et al.  Active Learning for Anomaly and Rare-Category Detection , 2004, NIPS.

[3]  Jaime G. Carbonell,et al.  Optimizing estimated loss reduction for active sampling in rank learning , 2008, ICML '08.

[4]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[5]  Tie-Yan Liu,et al.  Learning to rank for information retrieval , 2009, SIGIR.

[6]  Tomás Pevný,et al.  Loda: Lightweight on-line detector of anomalies , 2016, Machine Learning.

[7]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[8]  R. Koenker Quantile Regression: Name Index , 2005 .

[9]  Tomás Pevný,et al.  Learning combination of anomaly detectors for security domain , 2016, Comput. Networks.

[10]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[11]  Stephen P. Boyd,et al.  Accuracy at the Top , 2012, NIPS.

[12]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[13]  Kalyan Veeramachaneni,et al.  AI^2: Training a Big Data Machine to Defend , 2016, 2016 IEEE 2nd International Conference on Big Data Security on Cloud (BigDataSecurity), IEEE International Conference on High Performance and Smart Computing (HPSC), and IEEE International Conference on Intelligent Data and Security (IDS).

[14]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[15]  Marius Kloft,et al.  Toward Supervised Anomaly Detection , 2014, J. Artif. Intell. Res..

[16]  John Platt,et al.  ALADIN: Active Learning of Anomalies to Detect Intrusion , 2008 .

[17]  Thomas G. Dietterich,et al.  Detecting insider threats in a real corporate database of computer usage activity , 2013, KDD.

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Carey E. Priebe,et al.  COMPARATIVE EVALUATION OF PATTERN RECOGNITION TECHNIQUES FOR DETECTION OF MICROCALCIFICATIONS IN MAMMOGRAPHY , 1993 .

[20]  Prateek Jain,et al.  Surrogate Functions for Maximizing Precision at the Top , 2015, ICML.