Semi-supervised Based Training Set Construction for Outlier Detection

Outliers are sparse and few. It's costly to obtain a training set with enough outliers so that existing approaches to the problem of outlier detection seldom processed with supervised manner. However, given a training set with sufficient outliers, supervised outlier detection perform better than other methods. Traditional training set need to label each sample, but we can only label out the outliers and the other unlabeled ones can be directly marked as inliers to construct training set. In most cases, the number of samples we can label is limited and a large number of samples can be easily obtained without labeling. Semi-Supervised learning methods have a nature advantage in utilizing information of little labeled samples and large unlabeled samples to predict unlabeled instances. Based on this idea, we propose a algorithm CRLC constructing training set combining semi-supervised outlier detection. Our experiments show that our algorithm achieves better performance compared to other methods with the same cost.

[1]  Chao Chen,et al.  Using Random Forest to Learn Imbalanced Data , 2004 .

[2]  Xiaoli Z. Fern,et al.  Constructing Training Sets for Outlier Detection , 2012, SDM.

[3]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[4]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[5]  Jing Gao,et al.  Semi-supervised outlier detection , 2006, SAC '06.

[6]  Bianca Zadrozny,et al.  Outlier detection by active learning , 2006, KDD '06.

[7]  Youlin Shang,et al.  Semi-supervised outlier detection based on fuzzy rough C-means clustering , 2010, Math. Comput. Simul..

[8]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[9]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[10]  Gregory Z. Grudic,et al.  Unsupervised Outlier Detection and Semi-Supervised Learning ; CU-CS-976-04 , 2004 .

[11]  Uriel J. Carrasquilla Benchmarking Algorithms for Detecting Anomalies in Large Datasets , 2010 .

[12]  Zhi-Hua Zhou,et al.  Isolation Forest , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[13]  Carla E. Brodley,et al.  Anomaly Detection Using an Ensemble of Feature Models , 2010, 2010 IEEE International Conference on Data Mining.

[14]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[15]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD 2000.

[16]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[17]  Christian Böhm,et al.  CoCo: coding cost for parameter-free outlier detection , 2009, KDD.

[18]  Fabio Crestani,et al.  Proceedings of the 2006 ACM symposium on Applied computing , 2006 .