On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms

Outlier detection is a fundamental issue in data mining, specifically in fraud detection, network intrusion detection, network monitoring, etc. SmartSifter is an outlier detection engine addressing this problem from the viewpoint of statistical learning theory. This paper provides a theoretical basis for SmartSifter and empirically demonstrates its effectiveness. SmartSifter detects outliers in an on-line process through the on-line unsupervised learning of a probabilistic model (using a finite mixture model) of the information source. Each time a datum is input SmartSifter employs an on-line discounting learning algorithm to learn the probabilistic model. A score is given to the datum based on the learned model with a high score indicating a high possibility of being a statistical outlier. The novel features of SmartSifter are: (1) it is adaptive to non-stationary sources of data; (2) a score has a clear statistical/information-theoretic meaning; (3) it is computationally inexpensive; and (4) it can handle both categorical and continuous variables. An experimental application to network intrusion detection shows that SmartSifter was able to identify data with high scores that corresponded to attacks, with low computational costs. Further experimental application has identified a number of meaningful rare cases in actual health insurance pathology data from Australia's Health Insurance Commission.

[1]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[2]  I. Grabec Self-organization of neurons described by the maximum-entropy principle , 1990, Biological Cybernetics.

[3]  Dino Pedreschi,et al.  A classification-based methodology for planning audit strategies in fraud detection , 1999, KDD '99.

[4]  David M. Rocke Robustness properties of S-estimators of multivariate location and shape in high dimension , 1996 .

[5]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[6]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[7]  Salvatore J. Stolfo,et al.  Mining Audit Data to Build Intrusion Detection Models , 1998, KDD.

[8]  John Shawe-Taylor,et al.  Detecting Cellular Fraud Using Adaptive Prototypes. , 1997, AAAI 1997.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Yizhak Idan,et al.  Discovery of fraud rules for telecommunications—challenges and solutions , 1999, KDD '99.

[11]  Raphail E. Krichevsky,et al.  The performance of universal encoding , 1981, IEEE Trans. Inf. Theory.

[12]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[13]  Yiming Yang,et al.  Topic Detection and Tracking Pilot Study Final Report , 1998 .

[14]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[15]  Tom Fawcett,et al.  Combining Data Mining and Machine Learning for Effective Fraud Detection , 1997 .

[16]  Salvatore J. Stolfo,et al.  Mining in a data-flow environment: experience in network intrusion detection , 1999, KDD '99.

[17]  Geoffrey E. Hinton,et al.  A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants , 1998, Learning in Graphical Models.

[18]  Graham J. Williams,et al.  On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms , 2000, KDD '00.

[19]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[20]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[21]  Salvatore J. Stolfo,et al.  Toward Scalable Learning with Non-Uniform Class and Cost Distributions: A Case Study in Credit Card Fraud Detection , 1998, KDD.

[22]  M. Wand,et al.  EXACT MEAN INTEGRATED SQUARED ERROR , 1992 .

[23]  Carla E. Brodley,et al.  Approaches to Online Learning and Concept Drift for User Identification in Computer Security , 1998, KDD.

[24]  Geoffrey J. McLachlan,et al.  On the choice of the number of blocks with the incremental EM algorithm for the fitting of normal mixtures , 2003, Stat. Comput..

[25]  Graham J. Williams,et al.  Mining the Knowledge Mine: The Hot Spots Methodology for Mining Large Real World Databases , 1997, Australian Joint Conference on Artificial Intelligence.

[26]  Jaideep Srivastava,et al.  Event detection from time series data , 1999, KDD '99.

[27]  Joos Vandewalle,et al.  Detection of Mobile Phone Fraud Using Supervised Neural Networks: A First Prototype , 1997, ICANN.