Active learning and subspace clustering for anomaly detection

Today, anomaly detection is a highly valuable application in the analysis of current huge datasets. Insurance companies, banks and many manufacturing industries need systems to help humans to detect anomalies in their daily information. In general, anomalies are a very small fraction of the data, therefore their detection is not an easy task. Usually real sources of an anomaly are given by specific values expressed on selective dimensions of datasets, furthermore, many anomalies are not really interesting for humans, due to the fact that interestingness of anomalies is categorized subjectively by the human user. In this paper we propose a new semi-supervised algorithm that actively learns to detect relevant anomalies by interacting with an expert user in order to obtain semantic information about user preferences. Our approach is based on 3 main steps. First, a Bayes network identifies an initial set of candidate anomalies. Afterwards, a subspace clustering technique identifies relevant subsets of dimensions. Finally, a probabilistic active learning scheme, based on properties of Dirichlet distribution, uses the feedback from an expert user to efficiently search for relevant anomalies. Our results, using synthetic and real datasets, indicate that, under noisy data and anomalies presenting regular patterns, our approach correctly identifies relevant anomalies.

[1]  Prabhakar Raghavan,et al.  A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[2]  Damminda Alahakoon,et al.  Minority report in fraud detection: classification of skewed data , 2004, SKDD.

[3]  Jingrui He,et al.  Nearest-Neighbor-Based Active Learning for Rare Category Detection , 2007, NIPS.

[4]  H. Sebastian Seung,et al.  Query by committee , 1992, COLT '92.

[5]  Alvaro Soto,et al.  AN ACCELERATED ALGORITHM FOR DENSITY ESTIMATION IN LARGE DATABASES USING GAUSSIAN MIXTURES , 2007, Cybern. Syst..

[6]  Daphne Koller,et al.  Active Learning for Parameter Estimation in Bayesian Networks , 2000, NIPS.

[7]  Ana Bianco,et al.  Outlier Detection in Regression Models with ARIMA Errors Using Robust Estimates , 2001 .

[8]  Ira Assent,et al.  Outlier detection and ranking based on subspace clustering , 2008, Uncertainty Management in Information Systems.

[9]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[10]  Jae-Hyoung Yoo,et al.  Volume Traffic Anomaly Detection Using Hierarchical Clustering , 2009, APNOMS.

[11]  William A. Gale,et al.  A sequential algorithm for training text classifiers , 1994, SIGIR '94.

[12]  Andrew McCallum,et al.  Toward Optimal Active Learning through Sampling Estimation of Error Reduction , 2001, ICML.

[13]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[14]  X. Shao,et al.  Simultaneous Wavelength Selection and Outlier Detection in Multivariate Regression of Near-Infrared Spectra , 2005, Analytical sciences : the international journal of the Japan Society for Analytical Chemistry.

[15]  Alvaro Soto,et al.  UNSUPERVISED ANOMALY DETECTION IN LARGE DATABASES USING BAYESIAN NETWORKS , 2008, Appl. Artif. Intell..

[16]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[17]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[18]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[19]  Qiang Chen,et al.  An anomaly detection technique based on a chi‐square statistic for detecting intrusions into information systems , 2001 .

[20]  Terran Lane,et al.  An Application of Machine Learning to Anomaly Detection , 1999 .

[21]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[22]  Tsuhan Chen,et al.  An active learning framework for content-based information retrieval , 2002, IEEE Trans. Multim..

[23]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[24]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[25]  Anthony K. H. Tung,et al.  Mining top-n local outliers in large databases , 2001, KDD '01.

[26]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[27]  Yi Lu,et al.  Clustering and Classification Based Anomaly Detection , 2006, FSKD.

[28]  Michael R. Berthold,et al.  Active learning for object classification: from exploration to exploitation , 2009, Data Mining and Knowledge Discovery.

[29]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[30]  Alvaro Soto,et al.  Detection of Rare Objects in Massive Astronomical Datasets Using Innovative Knowledge Discovery Technology , 2005 .

[31]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[32]  Deepak K. Agarwal,et al.  An empirical Bayes approach to detect anomalies in dynamic multidimensional arrays , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[33]  Andrew W. Moore,et al.  Active Learning for Anomaly and Rare-Category Detection , 2004, NIPS.

[34]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[35]  Joshua Zhexue Huang,et al.  Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values , 1998, Data Mining and Knowledge Discovery.

[36]  Chang-Tien Lu,et al.  Survey of fraud detection techniques , 2004, IEEE International Conference on Networking, Sensing and Control, 2004.

[37]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[38]  Richard E. Neapolitan,et al.  Learning Bayesian networks , 2007, KDD '07.

[39]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[40]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[41]  George H. John Robust Decision Trees: Removing Outliers from Databases , 1995, KDD.

[42]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[43]  Dieter Filbert,et al.  Automated flaw detection in aluminum castings based on the tracking of potential defects in a radioscopic image sequence , 2002, IEEE Trans. Robotics Autom..

[44]  F. E. Grubbs Procedures for Detecting Outlying Observations in Samples , 1969 .

[45]  A. Mujumdar Applications of Artificial Intelligence in Engineering , 1996 .

[46]  Cecilia Surace,et al.  A novelty detection method to diagnose damage in structures: An application to an offshore platform , 1998 .

[47]  A. A. Mahabal,et al.  Searches for Rare and New Types of Objects , 2000 .

[48]  Nir Friedman,et al.  Learning Bayesian Network Structure from Massive Datasets: The "Sparse Candidate" Algorithm , 1999, UAI.