Feature bagging for outlier detection

Outlier detection has recently become an important problem in many industrial and financial applications. In this paper, a novel feature bagging approach for detecting outliers in very large, high dimensional and noisy databases is proposed. It combines results from multiple outlier detection algorithms that are applied using different set of features. Every outlier detection algorithm uses a small subset of features that are randomly selected from the original feature set. As a result, each outlier detector identifies different outliers, and thus assigns to all data records outlier scores that correspond to their probability of being outliers. The outlier scores computed by the individual outlier detection algorithms are then combined in order to find the better quality outliers. Experiments performed on several synthetic and real life data sets show that the proposed methods for combining outputs from multiple outlier detection algorithms provide non-trivial improvements over the base algorithm.

[1]  Jan M. Zytkow,et al.  Unified algorithm for undirected discovery of exception rules , 2005, Int. J. Intell. Syst..

[2]  Ramakant Nevatia,et al.  Improved Rooftop Detection in Aerial Images with Machine Learning , 2003, Machine Learning.

[3]  A. Hadi,et al.  BACON: blocked adaptive computationally efficient outlier nominators , 2000 .

[4]  C. Lee Giles,et al.  Inquirus, the NECI Meta Search Engine , 1998, Comput. Networks.

[5]  Sridhar Ramaswamy,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[6]  Joydeep Ghosh,et al.  Cluster Ensembles --- A Knowledge Reuse Framework for Combining Multiple Partitions , 2002, J. Mach. Learn. Res..

[7]  Eleazar Eskin,et al.  Anomaly Detection over Noisy Data using Learned Probability Distributions , 2000, ICML.

[8]  Philip S. Yu,et al.  Finding generalized projected clusters in high dimensional spaces , 2000, SIGMOD '00.

[9]  Sameer Singh,et al.  Novelty detection: a review - part 1: statistical approaches , 2003, Signal Process..

[10]  Charu C. Aggarwal,et al.  Re-designing distance functions and distance-based applications for high dimensional data , 2001, SGMD.

[11]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[12]  Sanjay Chawla,et al.  On local spatial outliers , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[13]  Vipin Kumar,et al.  Expert agreement and content based reranking in a meta search environment using Mearf , 2002, WWW '02.

[14]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[15]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[16]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[17]  Thomas G. Dietterich,et al.  Error-Correcting Output Coding Corrects Bias and Variance , 1995, ICML.

[18]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[19]  Eleazar Eskin,et al.  A GEOMETRIC FRAMEWORK FOR UNSUPERVISED ANOMALY DETECTION: DETECTING INTRUSIONS IN UNLABELED DATA , 2002 .

[20]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[21]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.

[22]  Dimitrios Gunopulos,et al.  Automatic Subspace Clustering of High Dimensional Data , 2005, Data Mining and Knowledge Discovery.

[23]  Vipin Kumar,et al.  CREDOS: Classification Using Ripple Down Structure (A Case for Rare Classes) , 2004, SDM.

[24]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[25]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[26]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[27]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[28]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[29]  Aidong Zhang,et al.  FindOut: Finding Outliers in Very Large Datasets , 2002, Knowledge and Information Systems.

[30]  Jan M. Zytkow,et al.  Unified Algorithm for Undirected Discovery of Execption Rules , 2000, PKDD.

[31]  Simone Santini,et al.  Similarity Measures , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[32]  Adele E. Howe,et al.  SAVVYSEARCH: A Metasearch Engine That Learns Which Search Engines to Query , 1997, AI Mag..

[33]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[34]  Peter McBurney,et al.  Chance Discovery , 2003, Advanced Information Processing.

[35]  Vipin Kumar,et al.  Predicting rare classes: can boosting make any weak learner strong? , 2002, KDD.

[36]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[37]  Nitesh V. Chawla,et al.  SMOTEBoost: Improving Prediction of the Minority Class in Boosting , 2003, PKDD.

[38]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[39]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[40]  Nada Lavrac,et al.  The Multi-Purpose Incremental Learning System AQ15 and Its Testing Application to Three Medical Domains , 1986, AAAI.