Unsupervised Ensemble Learning for Mining Top-n Outliers

Outlier detection is an important and attractive problem in knowledge discovery in large datasets. Instead of detecting an object as an outlier, we study detecting the n most outstanding outliers, i.e. the top-n outlier detection. Further, we consider the problem of combining the top-n outlier lists from various individual detection methods. A general framework of ensemble learning in the top-n outlier detection is proposed based on the rank aggregation techniques. A score-based aggregation approach with the normalization method of outlier scores and an order-based aggregation approach based on the distance-based Mallows model are proposed to accommodate various scales and characteristics of outlier scores from different detection methods. Extensive experiments on several real datasets demonstrate that the proposed approaches always deliver a stable and effective performance independent of different datasets in a good scalability in comparison with the state-of-the-art literature.

[1]  Vivekanand Gopalkrishnan,et al.  Mining Outliers with Ensemble of Heterogeneous Detectors on Random Subspaces , 2010, DASFAA.

[2]  A. Madansky Identification of Outliers , 1988 .

[3]  Xiaoqin Zhang,et al.  RKOF: Robust Kernel-Based Local Outlier Detection , 2011, PAKDD.

[4]  Jing Gao,et al.  Converting Output Scores from Outlier Detection Algorithms into Probability Estimates , 2006, Sixth International Conference on Data Mining (ICDM'06).

[5]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[6]  Dan Roth,et al.  Unsupervised rank aggregation with distance-based models , 2008, ICML '08.

[7]  John D. Lafferty,et al.  Cranking: Combining Rankings Using Conditional Probability Models on Permutations , 2002, ICML.

[8]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[9]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[10]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[11]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[12]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[13]  Edward A. Fox,et al.  Combination of Multiple Searches , 1993, TREC.

[14]  Yiyu Yao,et al.  Local peculiarity factor and its application in outlier detection , 2008, KDD.

[15]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[16]  Christos Faloutsos,et al.  LOCI: fast outlier detection using the local correlation integral , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[17]  C. L. Mallows NON-NULL RANKING MODELS. I , 1957 .