论文信息 - Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

[1] Catherine Blake,et al. UCI Repository of machine learning databases , 1998 .

[2] Eamonn J. Keogh,et al. Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[3] Jean-François Boulicaut,et al. A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[4] Michael Georgiopoulos,et al. A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[5] Zengyou He,et al. FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[6] L. Beran,et al. [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[7] Jaideep Srivastava,et al. Data Mining for Network Intrusion Detection , 2002 .

[8] Osmar R. Zaïane,et al. Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data , 2008, Knowledge and Information Systems.

[9] Jianyong Wang,et al. On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[10] W. R. Buckland,et al. Outliers in Statistical Data , 1979 .

[11] Ramakrishnan Srikant,et al. Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12] Wang Wei,et al. Non-Almost-Derivable Frequent Itemsets Mining , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[13] Toon Calders,et al. Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[14] Nicolas Pasquier,et al. Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[15] A. Madansky. Identification of Outliers , 1988 .

[16] Kuen-Fang Jea,et al. Discovering frequent itemsets by support approximation and itemset clustering , 2008, Data Knowl. Eng..

[17] Chandan Srivastava,et al. Support Vector Data Description , 2011 .

[18] Hui Xiong,et al. Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19] Philip S. Yu,et al. Outlier detection for high dimensional data , 2001, SIGMOD '01.

[20] Srinivasan Parthasarathy,et al. Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[21] Zengyou He,et al. A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[22] Georgios C. Anagnostopoulos,et al. A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23] Philip S. Yu,et al. Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[24] Mohammed J. Zaki,et al. Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25] Henrik Grosskreutz,et al. Approximating the number of frequent sets in dense data , 2009, Knowledge and Information Systems.

[26] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27] Georgios C. Anagnostopoulos,et al. Detecting Outliers in High-Dimensional Datasets with Mixed Attributes , 2008, DMIN.

[28] Stephen D. Bay,et al. Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[29] Raymond T. Ng,et al. Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.