论文信息 - ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

We propose an algorithm for detecting patterns exhibited by anomalous clusters in high dimensional discrete data. Unlike most anomaly detection (AD) methods, which detect individual anomalies, our proposed method detects groups (clusters) of anomalies; i.e., sets of points which collectively exhibit abnormal patterns. In many applications, this can lead to a better understanding of the nature of the atypical behavior and to identifying the sources of the anomalies. Moreover, we consider the case where the atypical patterns exhibit on only a small (salient) subset of the very high dimensional feature space. Individual AD techniques and techniques that detect anomalies using all the features typically fail to detect such anomalies, but our method can detect such instances collectively, discover the shared anomalous patterns exhibited by them, and identify the subsets of salient features. In this paper, we focus on detecting anomalous topics in a batch of text documents, developing our algorithm based on topic models. Results of our experiments show that our method can accurately detect anomalous topics and salient features (words) under each such topic in a synthetic data set and two real-world text corpora and achieves better performance compared to both standard group AD and individual AD techniques. All required code to reproduce our experiments is available from https://github.com/hsoleimani/ATD.

David J. Miller | Hossein Soleimani | Hossein Soleimani

[1] G. Schwarz. Estimating the Dimension of a Model , 1978 .

[2] Daniel B. Neill,et al. Fast generalized subset scan for anomalous pattern detection , 2013, J. Mach. Learn. Res..

[3] Venkatesh Saligrama,et al. Anomaly Detection with Score functions based on Nearest Neighbor Graphs , 2009, NIPS.

[4] Karl Pearson F.R.S.. X. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling , 2009 .

[5] Abhinav Srivastava,et al. Credit Card Fraud Detection Using Hidden Markov Model , 2008, IEEE Transactions on Dependable and Secure Computing.

[6] A. Agresti. An introduction to categorical data analysis , 1997 .

[7] James Allan,et al. On-Line New Event Detection and Tracking , 1998, SIGIR.

[8] David M. Blei,et al. Probabilistic topic models , 2012, Commun. ACM.

[9] Salvatore J. Stolfo,et al. Anomalous Payload-Based Network Intrusion Detection , 2004, RAID.

[10] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[11] Michael I. Jordan,et al. An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[12] Barnabás Póczos,et al. Group Anomaly Detection using Flexible Genre Models , 2011, NIPS.

[13] John A. Major,et al. EFD: A Hybrid Knowledge/Statistical-Based System for the Detection of Fraud , 2002 .

[14] George Kesidis,et al. Detecting anomalous latent classes in a batch of network traffic flows , 2014, 2014 48th Annual Conference on Information Sciences and Systems (CISS).

[15] K. Pearson. On the Criterion that a Given System of Deviations from the Probable in the Case of a Correlated System of Variables is Such that it Can be Reasonably Supposed to have Arisen from Random Sampling , 1900 .

[16] VARUN CHANDOLA,et al. Anomaly detection: A survey , 2009, CSUR.

[17] John A. Major,et al. EFD: A hybrid knowledge/statistical‐based system for the detection of fraud , 1992, Int. J. Intell. Syst..

[18] Bernhard Schölkopf,et al. One-Class Support Measure Machines for Group Anomaly Detection , 2013, UAI.

[19] Martin F. Porter,et al. An algorithm for suffix stripping , 1997, Program.

[20] Qi He,et al. Keep It Simple with Time: A Reexamination of Probabilistic Topic Detection Models , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[22] Xiao-Li Meng,et al. The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[23] S. S. Wilks. The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .

[24] Bernhard Schölkopf,et al. Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[25] Andrew W. Moore,et al. Rule-based anomaly pattern detection for detecting disease outbreaks , 2002, AAAI/IAAI.

[26] Rose Yu,et al. GLAD: group anomaly detection in social media analysis , 2014, ACM Trans. Knowl. Discov. Data.

[27] Barnabás Póczos,et al. Hierarchical Probabilistic Models for Group Anomaly Detection , 2011, AISTATS.

[28] Jeff G. Schneider,et al. Anomaly pattern detection in categorical datasets , 2008, KDD.

[29] Xiaolong Wang,et al. Online topic detection and tracking of financial news based on hierarchical clustering , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[30] David Hinkley,et al. Bootstrap Methods: Another Look at the Jackknife , 2008 .

[31] Malik Yousef,et al. One-Class SVMs for Document Classification , 2002, J. Mach. Learn. Res..

[32] David J. Miller,et al. Parsimonious Topic Models with Salient Word Discovery , 2014, IEEE Transactions on Knowledge and Data Engineering.

[33] Victoria J. Hodge,et al. A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[34] Andrew W. Moore,et al. Bayesian Network Anomaly Pattern Detection for Disease Outbreaks , 2003, ICML.