A Novel Isolation-Based Outlier Detection Method

Outlier detection is one of the most important tasks in data analysis. It refers to the process of recognizing unusual characteristics which may provide useful insights in helping us to understand the behaviors of data. In the paper, an isolation-based outlier detection method, called Entropy-based Greedy Isolation Tree (EGiTree), is proposed. Unlike other treelike detection methods, our method exploits a half-baked isolation tree, which is constructed via three entropy-based heuristics, to identify outliers. Specifically, the heuristics are used to guide the selection process of attribute and its split value when constructing the tree. Thus, the outlierness score of each data point is estimated based on the total partition cost of the isolation node in the tree, as well as the path length and complexity of partition. Experiment results on public real-world datasets show that our approach outperforms distanced-based, density-based, subspace-based as well as state-of-the-art isolation-based approaches.

[1]  Tom Fawcett,et al.  Activity monitoring: noticing interesting changes in behavior , 1999, KDD '99.

[2]  Hans-Peter Kriegel,et al.  Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data , 2009, PAKDD.

[3]  Hans-Peter Kriegel,et al.  The X-tree : An Index Structure for High-Dimensional Data , 2001, VLDB.

[4]  D. Endler,et al.  Intrusion detection. Applying machine learning to Solaris audit data , 1998, Proceedings 14th Annual Computer Security Applications Conference (Cat. No.98EX217).

[5]  Klemens Böhm,et al.  HiCS: High Contrast Subspaces for Density-Based Outlier Ranking , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[6]  Zhi-Hua Zhou,et al.  On Detecting Clustered Anomalies Using SCiForest , 2010, ECML/PKDD.

[7]  Karsten M. Borgwardt,et al.  Rapid Distance-Based Outlier Detection via Sampling , 2013, NIPS.

[8]  Ji Zhang,et al.  Advancements of Outlier Detection: A Survey , 2013, EAI Endorsed Trans. Scalable Inf. Syst..

[9]  H. E. Solberg,et al.  Detection of outliers in reference distributions: performance of Horn's algorithm. , 2005, Clinical chemistry.

[10]  Fei Tony Liu,et al.  Isolation-Based Anomaly Detection , 2012, TKDD.

[11]  Michael Brady,et al.  Novelty detection for the identification of masses in mammograms , 1995 .

[12]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[13]  Kenji Yamanishi,et al.  Discovering outlier filtering rules from unlabeled data: combining a supervised learner with an unsupervised learner , 2001, KDD '01.

[14]  Rajeev Rastogi,et al.  Efficient algorithms for mining outliers from large data sets , 2000, SIGMOD 2000.

[15]  Anthony K. H. Tung,et al.  Ranking Outliers Using Symmetric Neighborhood Relationship , 2006, PAKDD.

[16]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[17]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[18]  Osmar R. Zaïane,et al.  A Nonparametric Outlier Detection for Effectively Discovering Top-N Outliers from Engineering Data , 2006, PAKDD.

[19]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[20]  Jian Tang,et al.  Enhancing Effectiveness of Outlier Detections for Low Density Patterns , 2002, PAKDD.

[21]  Raymond T. Ng,et al.  Finding Intensional Knowledge of Distance-Based Outliers , 1999, VLDB.

[22]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[23]  V. Barnett The Ordering of Multivariate Data , 1976 .

[24]  Martti Juhola,et al.  Informal identification of outliers in medical data , 2000 .

[25]  Xiaohua Jia,et al.  Welcome message from the Editor-in-Chief and Co-Editor-in-Chief , 2013, EAI Endorsed Trans. Scalable Inf. Syst..

[26]  Ke Zhang,et al.  A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data , 2009, PAKDD.

[27]  Jiawei Han,et al.  Data Mining: Concepts and Techniques , 2000 .

[28]  Hans-Peter Kriegel,et al.  The R*-tree: an efficient and robust access method for points and rectangles , 1990, SIGMOD '90.

[29]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[30]  Christopher Krügel,et al.  Service specific anomaly detection for network intrusion detection , 2002, SAC '02.

[31]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .