An Interactive Approach to Outlier Detection

In this paper we describe an interactive approach for finding outliers in big sets of records, such as collected by banks, insurance companies, web shops. The key idea behind our approach is the usage of an easy-to-compute and easy-to-interpret outlier score function. This function is used to identify a set of potential outliers. The outliers, organized in clusters, are then presented to a domain expert, together with some context information, such as characteristics of clusters and distribution of scores. Consequently, they are analyzed, labelled as non-explainable or explainable, and removed from the data. The whole process is iterated several times, until no more interesting outliers can be found.

[1]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[2]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[3]  Deepak Agarwal,et al.  Detecting anomalies in cross-classified streams: a Bayesian approach , 2006, Knowledge and Information Systems.

[4]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[5]  Graham J. Williams,et al.  On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms , 2000, KDD '00.

[6]  Kenji Yamanishi,et al.  Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner , 2001, KDD '01.

[7]  David A. Belsley,et al.  Regression Analysis and its Application: A Data-Oriented Approach.@@@Applied Linear Regression.@@@Regression Diagnostics: Identifying Influential Data and Sources of Collinearity , 1981 .

[8]  Sheldon M. Ross,et al.  Introduction to Probability and Statistics for Engineers and Scientists , 1987 .

[9]  Peter J. Rousseeuw,et al.  Robust Regression and Outlier Detection , 2005, Wiley Series in Probability and Statistics.

[10]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[11]  Hiroyuki Kitagawa,et al.  Example-based Outlier Detection with Relevance Feedback , 2004 .

[12]  Donald E. Ramirez,et al.  Bringing Order to Outlier Diagnostics in Regression Models , 2001 .