Positive and Unlabeled Learning for Anomaly Detection with Multi-features

Anomaly detection is of great interest to big data applications, and both supervised and unsupervised learning have been applied for anomaly detection. However, it still remains a challenging problem because: (1) for supervised learning, it is difficult to acquire training data for anomaly samples; while (2) for unsupervised learning, the performance may not be satisfactory due to the lack of training data. To address the limitations, we propose a hybrid solution by using both normal (positive) data and unlabeled data (could be positive or negative) for semi-supervised anomaly detection. Particularly, we introduce a new framework based on Positive and Unlabeled (PU) Learning using multi-features to detect anomalies. We extend previous PU learning methods to (1) better address unbalanced class problem which is typical for anomaly detection, and (2) handle multiple features for anomaly detection. An iterative algorithm is proposed to learn the anomaly classifier incrementally from the labeled normal data and also unlabeled data. Our proposed method is verified on three benchmark datasets and one synthetic dataset. Experimental results show that our method outperforms existing methods under different class priors and different proportions of given positive classes.

[1]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Chee Keong Kwoh,et al.  Positive-unlabeled learning for disease gene identification , 2012, Bioinform..

[3]  Zhi-Hua Zhou,et al.  Semi-supervised multi-instance multi-label learning for video annotation task , 2012, ACM Multimedia.

[4]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[5]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[6]  Salvatore J. Stolfo,et al.  One Class Support Vector Machines for Detecting Anomalous Windows Registry Accesses , 2003 .

[7]  Gang Niu,et al.  Theoretical Comparisons of Learning from Positive-Negative, Positive-Unlabeled, and Negative-Unlabeled Data , 2016, ArXiv.

[8]  Bhiksha Raj,et al.  Audio Event Detection using Weakly Labeled Data , 2016, ACM Multimedia.

[9]  Gang Niu,et al.  Analysis of Learning from Positive and Unlabeled Data , 2014, NIPS.

[10]  Wei Liu,et al.  Optimal semi-supervised metric learning for image retrieval , 2012, ACM Multimedia.

[11]  Arjun Mukherjee,et al.  Spotting Fake Reviews using Positive-Unlabeled Learning , 2014, Computación y Sistemas.

[12]  Xiaoli Li,et al.  Ensemble Positive Unlabeled Learning for Disease Gene Identification , 2014, PloS one.

[13]  Jungsuk Song,et al.  Cooperation of Intelligent Honeypots to Detect Unknown Malicious Codes , 2008, 2008 WOMBAT Workshop on Information Security Threats Data Collection and Sharing.

[14]  Gang Niu,et al.  Convex Formulation for Learning from Positive and Unlabeled Data , 2015, ICML.

[15]  Masashi Sugiyama,et al.  Class Prior Estimation from Positive and Unlabeled Data , 2014, IEICE Trans. Inf. Syst..

[16]  Vipin Kumar,et al.  Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach , 2003, Clustering and Information Retrieval.

[17]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[18]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[21]  Jean-Philippe Vert,et al.  ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples , 2011, BMC Bioinformatics.

[22]  Dong-Hong Ji,et al.  Positive Unlabeled Learning for Deceptive Reviews Detection , 2014, EMNLP.

[23]  Jeff G. Schneider,et al.  Detecting anomalous records in categorical datasets , 2007, KDD '07.