Fast outlier detection for very large log data

Density-based outlier detection identifies an outlying observation with reference to the density of the surrounding space. In spite of the several advantages of density-based outlier detections, its computational complexity remains one of the major barriers to its application. The purpose of the present study is to reduce the computation time of LOF (Local Outlier Factor), a density-based outlier detection algorithm. The proposed method incorporates kd-tree indexing and an approximated k-nearest neighbors search algorithm (ANN). Theoretical analysis on the approximation of nearest neighbor search was conducted. A set of experiments was conducted to examine the performance of the proposed algorithm. The results show that the method can effectively detect local outliers in a reduced computation time.

[1]  Aleksandar Lazarevic,et al.  Incremental Local Outlier Detection for Data Streams , 2007, 2007 IEEE Symposium on Computational Intelligence and Data Mining.

[2]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[3]  Jon Louis Bentley,et al.  K-d trees for semidynamic point sets , 1990, SCG '90.

[4]  Christos Faloutsos,et al.  FastMap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets , 1995, SIGMOD '95.

[5]  William H. Press,et al.  Numerical recipes in C , 2002 .

[6]  Sudipto Guha,et al.  CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[7]  Kazuo J. Ezawa,et al.  Constructing Bayesian Networks to Predict Uncollectible Telecommunications Accounts , 1996, IEEE Expert.

[8]  Sunil Arya,et al.  An optimal algorithm for approximate nearest neighbor searching fixed dimensions , 1998, JACM.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Vic Barnett,et al.  Outliers in Statistical Data , 1980 .

[11]  Douglas M. Hawkins Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[12]  Jaideep Srivastava,et al.  A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[13]  Chao-Hsien Chu,et al.  A Review of Data Mining-Based Financial Fraud Detection Research , 2007, 2007 International Conference on Wireless Communications, Networking and Mobile Computing.

[14]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[15]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[16]  Chun Zhang,et al.  Storing and querying ordered XML using a relational database system , 2002, SIGMOD '02.

[17]  Dimitrios Gunopulos,et al.  Automatic subspace clustering of high dimensional data for data mining applications , 1998, SIGMOD '98.

[18]  Marcus A. Maloof,et al.  Machine Learning and Data Mining for Computer Security: Methods and Applications (Advanced Information and Knowledge Processing) , 2005 .

[19]  Tian Zhang,et al.  BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[20]  D. Hinkley On the ratio of two correlated normal random variables , 1969 .

[21]  Marcus A. Maloof MACHINE LEARNING AND DATA MINING FOR COMPUTER SECURITY: METHODS AND APPLICATIONS , 2011 .

[22]  Mohammad Zulkernine,et al.  Anomaly Based Network Intrusion Detection with Unsupervised Outlier Detection , 2006, 2006 IEEE International Conference on Communications.

[23]  Keinosuke Fukunaga,et al.  Introduction to statistical pattern recognition (2nd ed.) , 1990 .

[24]  Sungzoon Cho,et al.  Keystroke dynamics-based authentication for mobile devices , 2009, Comput. Secur..

[25]  Ramakant Nevatia,et al.  Event Detection and Analysis from Video Streams , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Keinosuke Fukunaga,et al.  Introduction to Statistical Pattern Recognition , 1972 .

[27]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[28]  Jon Louis Bentley,et al.  An Algorithm for Finding Best Matches in Logarithmic Expected Time , 1977, TOMS.

[29]  Anastasios Kementsietsidis,et al.  Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data , 2001, SIGMOD 2011.