论文信息 - Capabilities of outlier detection schemes in large datasets, framework and methodologies

Capabilities of outlier detection schemes in large datasets, framework and methodologies

Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.

[1] Eamonn J. Keogh,et al. On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration , 2002, Data Mining and Knowledge Discovery.

[2] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[3] Hongxing He,et al. Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[4] Ramakrishnan Srikant,et al. Kdd-2001: Proceedings of the Seventh Acm Sigkdd International Conference on Knowledge Discovery and Data Mining : August 26-29, 2001 San Francisco, Ca, USA , 2002 .

[5] Jian Tang,et al. Modeling and efficient mining of intentional knowledge of outliers , 2003, Seventh International Database Engineering and Applications Symposium, 2003. Proceedings..

[6] Anthony K. H. Tung,et al. Mining top-n local outliers in large databases , 2001, KDD '01.

[7] Tian Zhang,et al. BIRCH: an efficient data clustering method for very large databases , 1996, SIGMOD '96.

[8] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[9] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[10] Douglas M. Hawkins. Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[11] Zengyou He,et al. Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[12] Aidong Zhang,et al. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases , 1998, VLDB.

[13] Stephen D. Bay,et al. Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[14] Raymond T. Ng,et al. Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[15] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[16] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[17] Jian Tang,et al. Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[18] Xiannong Meng,et al. FEATURES: Real-time adaptive feature and document learning for web search , 2001, J. Assoc. Inf. Sci. Technol..

[19] Thomas H. Cormen,et al. Introduction to algorithms [2nd ed.] , 2001 .

[20] Sam Yuan Sung,et al. Detecting pattern-based outliers , 2003, Pattern Recognit. Lett..

[21] Jaideep Srivastava,et al. A Comparative Study of Anomaly Detection Schemes in Network Intrusion Detection , 2003, SDM.

[22] Prabhakar Raghavan,et al. A Linear Method for Deviation Detection in Large Databases , 1996, KDD.

[23] Sudipto Guha,et al. CURE: an efficient clustering algorithm for large databases , 1998, SIGMOD '98.

[24] Clara Pizzuti,et al. Fast Outlier Detection in High Dimensional Spaces , 2002, PKDD.

[25] Gerard Salton,et al. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[26] Ronald L. Rivest,et al. Introduction to Algorithms , 1990 .

[27] William DuMouchel,et al. A Fast Computer Intrusion Detection Algorithm Based on Hypothesis Testing of Command Transition Probabilities , 1998, KDD.

[28] Nick Roussopoulos,et al. Nearest neighbor queries , 1995, SIGMOD '95.

[29] Sridhar Ramaswamy,et al. Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD '00.

[30] Clifford Stein,et al. Introduction to Algorithms, 2nd edition. , 2001 .

[31] Vic Barnett,et al. Outliers in Statistical Data , 1980 .

[32] Jian Tang,et al. On Complementarity of Cluster and Outlier Detection Schemes , 2003, DaWaK.

[33] Xiannong Meng,et al. On User-oriented Measurements of Effectiveness of Web Information Retrieval Systems , 2004, International Conference on Internet Computing.

[34] Salvatore J. Stolfo,et al. Cost-based modeling for fraud and intrusion detection: results from the JAM project , 2000, Proceedings DARPA Information Survivability Conference and Exposition. DISCEX'00.

[35] Tom Fawcett,et al. Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[36] Philip S. Yu,et al. Outlier detection for high dimensional data , 2001, SIGMOD '01.

[37] Raymond T. Ng,et al. Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.