Non-derivable itemsets for fast outlier detection in large high-dimensional categorical data

Detecting outliers in a dataset is an important data mining task with many applications, such as detection of credit card fraud or network intrusions. Traditional methods assume numerical data and compute pair-wise distances among points. Recently, outlier detection methods were proposed for categorical and mixed-attribute data using the concept of Frequent Itemsets (FIs). These methods face challenges when dealing with large high-dimensional data, where the number of generated FIs can be extremely large. To address this issue, we propose several outlier detection schemes inspired by the well-known condensed representation of FIs, Non-Derivable Itemsets (NDIs). Specifically, we contrast a method based on frequent NDIs, FNDI-OD, and a method based on the negative border of NDIs, NBNDI-OD, with their previously proposed FI-based counterparts. We also explore outlier detection based on Non-Almost Derivable Itemsets (NADIs), which approximate the NDIs in the data given a δ parameter. Our proposed methods use a far smaller collection of sets than the FI collection in order to compute an anomaly score for each data point. Experiments on real-life data show that, as expected, methods based on NDIs and NADIs offer substantial advantages in terms of speed and scalability over FI-based Outlier Detection method. What is significant is that NDI-based methods exhibit similar or better detection accuracy compared to the FI-based methods, which supports our claim that the NDI representation is especially well-suited for the task of detecting outliers. At the same time, the NDI approximation scheme, NADIs is shown to exhibit similar accuracy to the NDI-based method for various δ values and further runtime performance gains. Finally, we offer an in-depth discussion and experimentation regarding the trade-offs of the proposed algorithms and the choice of parameter values.

[1]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[2]  Eamonn J. Keogh,et al.  Disk aware discord discovery: finding unusual time series in terabyte sized datasets , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[3]  Jean-François Boulicaut,et al.  A Survey on Condensed Representations for Frequent Sets , 2004, Constraint-Based Mining and Inductive Databases.

[4]  Michael Georgiopoulos,et al.  A fast outlier detection strategy for distributed high-dimensional data sets with mixed attributes , 2010, Data Mining and Knowledge Discovery.

[5]  Zengyou He,et al.  FP-outlier: Frequent pattern based outlier detection , 2005, Comput. Sci. Inf. Syst..

[6]  L. Beran,et al.  [Formal concept analysis]. , 1996, Casopis lekaru ceskych.

[7]  Jaideep Srivastava,et al.  Data Mining for Network Intrusion Detection , 2002 .

[8]  Osmar R. Zaïane,et al.  Resolution-based outlier factor: detecting the top-n most outlying data points in engineering data , 2008, Knowledge and Information Systems.

[9]  Jianyong Wang,et al.  On efficiently summarizing categorical databases , 2005, Knowledge and Information Systems.

[10]  W. R. Buckland,et al.  Outliers in Statistical Data , 1979 .

[11]  Ramakrishnan Srikant,et al.  Fast Algorithms for Mining Association Rules in Large Databases , 1994, VLDB.

[12]  Wang Wei,et al.  Non-Almost-Derivable Frequent Itemsets Mining , 2005, The Fifth International Conference on Computer and Information Technology (CIT'05).

[13]  Toon Calders,et al.  Non-derivable itemset mining , 2007, Data Mining and Knowledge Discovery.

[14]  Nicolas Pasquier,et al.  Discovering Frequent Closed Itemsets for Association Rules , 1999, ICDT.

[15]  A. Madansky Identification of Outliers , 1988 .

[16]  Kuen-Fang Jea,et al.  Discovering frequent itemsets by support approximation and itemset clustering , 2008, Data Knowl. Eng..

[17]  Chandan Srivastava,et al.  Support Vector Data Description , 2011 .

[18]  Hui Xiong,et al.  Enhancing data analysis with noise removal , 2006, IEEE Transactions on Knowledge and Data Engineering.

[19]  Philip S. Yu,et al.  Outlier detection for high dimensional data , 2001, SIGMOD '01.

[20]  Srinivasan Parthasarathy,et al.  Fast Distributed Outlier Detection in Mixed-Attribute Data Sets , 2006, Data Mining and Knowledge Discovery.

[21]  Zengyou He,et al.  A Fast Greedy Algorithm for Outlier Mining , 2005, PAKDD.

[22]  Georgios C. Anagnostopoulos,et al.  A Scalable and Efficient Outlier Detection Strategy for Categorical Data , 2007, 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007).

[23]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[24]  Mohammed J. Zaki,et al.  Efficient algorithms for mining closed itemsets and their lattice structure , 2005, IEEE Transactions on Knowledge and Data Engineering.

[25]  Henrik Grosskreutz,et al.  Approximating the number of frequent sets in dense data , 2009, Knowledge and Information Systems.

[26]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[27]  Georgios C. Anagnostopoulos,et al.  Detecting Outliers in High-Dimensional Datasets with Mixed Attributes , 2008, DMIN.

[28]  Stephen D. Bay,et al.  Mining distance-based outliers in near linear time with randomization and a simple pruning rule , 2003, KDD '03.

[29]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.