论文信息 - Parallel Mining of Outliers in Large Database

Parallel Mining of Outliers in Large Database

Data mining is a new, important and fast growing database application. Outlier (exception) detection is one kind of data mining, which can be applied in a variety of areas like monitoring of credit card fraud and criminal activities in electronic commerce. With the ever-increasing size and attributes (dimensions) of database, previously proposed detection methods for two dimensions are no longer applicable. The time complexity of the Nested-Loop (NL) algorithm (Knorr and Ng, in Proc. 24th VLDB, 1998) is linear to the dimensionality but quadratic to the dataset size, inducing an unacceptable cost for large dataset.A more efficient version (ENL) and its parallel version (PENL) are introduced. In theory, the improvement of performance in PENL is linear to the number of processors, as shown in a performance comparison between ENL and PENL using Bulk Synchronization Parallel (BSP) model. The great improvement is further verified by experiments on a parallel computer system IBM 9076 SP2. The results show that it is a very good choice to mine outliers in a cluster of workstations with a low-cost interconnected by a commodity communication network.

David Wai-Lok Cheung | Edward Hung

[1] Hans-Peter Kriegel,et al. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[2] Vic Barnett,et al. Outliers in Statistical Data , 1980 .

[3] Jiawei Han,et al. Knowledge Discovery in Databases: An Attribute-Oriented Approach , 1992, VLDB.

[4] Raymond T. Ng,et al. A unified approach for mining outliers , 1997, CASCON.

[5] E. Knorr. On Digital Money and Card Technologies , 1997 .

[6] Jiawei Han,et al. Efficient and Effective Clustering Methods for Spatial Data Mining , 1994, VLDB.

[7] Douglas M. Hawkins. Identification of Outliers , 1980, Monographs on Applied Probability and Statistics.

[8] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[9] R. Ng,et al. Eecient and Eeective Clustering Methods for Spatial Data Mining , 1994 .

[10] Dwl Cheung,et al. Parallel Algorithm for Mining Outliers in Large Database , 1999 .

[11] Hans-Jörg Schek,et al. A Quantitative Analysis and Performance Study for Similarity-Search Methods in High-Dimensional Spaces , 1998, VLDB.