论文信息 - Network Neighborhood Analysis For Detecting Anomalies in Time Series of Graphs

Network Neighborhood Analysis For Detecting Anomalies in Time Series of Graphs

NETWORK NEIGHBORHOOD ANALYSIS FOR DETECTING ANOMALIES IN TIME SERIES OF GRAPHS Suchismita Goswami, PhD George Mason University, 2019 Dissertation Director: Dr. Igor Griva Around terabytes of unstructured electronic data are generated every day from twitter networks, scientific collaborations, organizational emails, telephone calls and websites. Excessive communications in communication networks, particularly in organizational e-mail networks, continue to be a major problem. In some cases, for example, Enron e-mails, frequent contact or excessive activities on interconnected networks lead to fraudulent activities. Analyzing the excessive activity in a social network is thus important to understand the behavior of individuals in subregions of a network. In a social network, anomalies can occur as a result of abrupt changes in the interactions among a group of individuals. Therefore, one needs to develop methodologies to analyze and detect excessive communications in dynamic social networks. The motivation of this research work is to investigate the excessive activities and make inferences in dynamic sub networks. In this dissertation work, I implement new methodologies and techniques to detect excessive communications, topic activities and the associated influential individuals in the dynamic networks obtained from organizational emails using scan statistics, multivariate time series models and probabilistic topic modeling. Three major contributions have been presented here to detect anomalies of dynamic networks obtained from organizational emails. At first, I develop a different approach by invoking the log-likelihood ratio as a scan statistic with overlapping and variable window sizes to rank the clusters, and devise a two-step scan process to detect the excessive activities in an organizations e-mail network as a case study. The initial step is to determine the structural stability of the e-mail count time series and perform differencing and de-seasonalizing operations to make the time series stationary, and obtain a primary cluster using a Poisson process model. I then extract neighborhood ego subnetworks around the observed primary cluster to obtain more refined cluster by invoking the graph invariant betweenness as the locality statistic using the binomial model. I demonstrate that the two-step scan statistics algorithm is more scalable in detecting excessive activity in large dynamic social networks. Secondly, I implement for the first time the multivariate time series models to detect a group of influential people and their dynamic relationships that are associated with excessive communications, which cannot be assessed using scan statistics models. For the multivariate modeling, a vector auto regressive (VAR) model has been employed in time series of subgraphs in e-mail networks constructed using the graph edit distance, as the nodes or vertices of the subgraphs are interrelated. Anomalies or excessive communications are assessed using the residual thresholds greater than three times the standard deviations, obtained from the fitted time series models. Finally, I devise a new method of detecting excessive topic activities from the unstructured text obtained from e-mail contents by combining the probabilistic topic modeling and scan statistics algorithms. Initially, I investigate the major topics discussed using the probabilistic modeling, such as latent Dirichlet allocation (LDA) modeling, then employ scan statistics to assess the excessive topic activities, which has the largest log likelihood ratio in the neighborhood of primary cluster. These analyses provide new ways of detecting the excessive communications and topic flow through the influential vertices in a dynamic network, and can be extended in other dynamic social networks to critically investigate excessive activities. Chapter 1: Introduction Anomalies, which are clusters of events or excessive or unusual activities, are common in science and technology. Some of the most commonly used methods for anomaly detection in data mining are density-based techniques such as k-nearest neighbor [KNT00] and local outlier factor [BKNS00], one class support vector machines [SPST+01], neural networks [HHWB00], cluster analysis-based outlier detection [HXD03] and ensemble techniques [LK05]. All these methods used to detect excessive activity, are mostly descriptive in nature, and not effective in making statistical inferences. In other words, these methods do not predict if these observed clusters of events are statistically significant or not [Kul79]. A very powerful statistical inference methodology that has been developed to detect the region of unusual activity in a random process and to infer the statistical significance of the observed excessive activity is scan statistics [Kul79], which is also termed as moving window analysis in the engineering literature and has mostly been used in spatial statistics and image analysis. Scan statistic is defined as a maximum or minimum of local statistics estimated from the local region of the data. Let {Xt, t ≥ 0} be a Poisson process with rate, λ, where Xt is the number of points (events) occurring in the interval [0, t). In any subinterval of [0, T) of length, w, let Yt be the number of points (events) in a window of the interval, [t, t+ w), such that Yt = Xt+w Xt. The one-dimensional continuous scan statistic, Sw, is written as [GB99]: Sw = max 0

Suchismita Goswami | Suchismita Goswami | S. Goswami

[1] Weizhong Zhao,et al. A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[2] Bernhard Pfaff,et al. Analysis of Integrated and Cointegrated Time Series with R , 2005 .

[3] Robert H. Shumway,et al. Time Series Analysis and Its Applications (Springer Texts in Statistics) , 2005 .

[4] C. Granger. Some properties of time series data and their use in econometric model specification , 1981 .

[5] David F. Hendry,et al. Co-Integration, Error Correction, and the Econometric Analysis of Non-Stationary Data , 1993, Advanced texts in econometrics.

[6] R. Hanneman. Introduction to Social Network Methods , 2001 .

[7] Michael I. Jordan,et al. Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8] L. Freeman. Centrality in social networks conceptual clarification , 1978 .

[9] Compositional Time Series : A First Approach , 2007 .

[10] P. Perron,et al. Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power , 2001 .

[11] Gerard Salton,et al. Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[12] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13] Thomas L. Griffiths,et al. Probabilistic Topic Models , 2007 .

[14] Thomas Hofmann,et al. Probabilistic latent semantic indexing , 1999, SIGIR '99.

[15] Michael D. Porter,et al. Network neighborhood analysis , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.

[16] D. Stroock,et al. Probability Theory: An Analytic View. , 1995 .

[17] Hans-Peter Kriegel,et al. LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[18] Earl Rennison,et al. Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[19] Murat Caner Testik,et al. CUSUM Monitoring of First-Order Integer-Valued Autoregressive Processes of Poisson Counts , 2009 .

[20] P. Cowpertwait,et al. Introductory Time Series with R , 2009 .

[21] X. Shao,et al. Testing for Change Points in Time Series , 2010 .

[22] David J. Marchette,et al. Statistical inference on attributed random graphs: Fusion of graph features and content , 2010, Comput. Stat. Data Anal..

[23] Joseph Glaz. CLUSTERING OF EVENTS IN A STOCHASTIC PROCESS , 1981 .

[24] Joseph Naus,et al. Approximations for Distributions of Scan Statistics , 1982 .

[25] A. I. McLeod,et al. Distribution of the Residual Autocorrelations in Multivariate Arma Time Series Models , 1981 .

[26] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[27] S. Karlin,et al. Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[28] Michaël Genin,et al. Discrete scan statistics and the generalized likelihood ratio test , 2015 .

[29] Edward J. Wegman,et al. A dynamic graph model for representing streaming text documents , 2008 .

[30] T. Caliński,et al. A dendrite method for cluster analysis , 1974 .

[31] Hongxing He,et al. Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[32] G. W. Milligan,et al. An examination of procedures for determining the number of clusters in a data set , 1985 .

[33] D. Marchette. Scan statistics on graphs , 2012 .

[34] Carey E. Priebe,et al. A Spatial Scan Statistic for Stochastic Scan Partitions , 1997 .

[35] M. Kulldorff,et al. International Journal of Health Geographics Open Access a Scan Statistic for Continuous Data Based on the Normal Probability Model , 2022 .

[36] S. Wallenstein,et al. Probabilities for the Size of Largest Clusters and Smallest Intervals , 1974 .

[37] Yi-Ting Chen,et al. On the Robustness of Ljung-Box and McLeod-Li Q Tests: A Simulation Study , 2002 .

[38] M. Kraetzl,et al. Detection of abnormal change in dynamic networks , 1999, 1999 Information, Decision and Control. Data and Information Fusion Symposium, Signal Processing and Communications Symposium and Decision and Control Symposium. Proceedings (Cat. No.99EX251).

[39] Sholom M. Weiss,et al. Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[40] Tetiana Stadnytska,et al. Deterministic or Stochastic Trend Decision on the Basis of the Augmented Dickey-Fuller Test , 2010 .

[41] Ross Sparks,et al. Early warning CUSUM plans for surveillance of negative binomial daily disease counts , 2010 .

[42] David J. Marchette,et al. Scan Statistics on Enron Graphs , 2005, Comput. Math. Organ. Theory.

[43] G. Casella,et al. Explaining the Gibbs Sampler , 1992 .

[44] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[45] Roger A. Sugden,et al. Multiple Imputation for Nonresponse in Surveys , 1988 .

[46] Brandon Pincombea,et al. Anomaly Detection in Time Series of Graphs using ARMA Processes , 2007 .

[47] Mark Newman,et al. Networks: An Introduction , 2010 .

[48] Helmut Ltkepohl,et al. New Introduction to Multiple Time Series Analysis , 2007 .

[49] Trevor Hastie,et al. An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[50] H. Bunke,et al. Median graphs and anomalous change detection in communication networks , 2002, Final Program and Abstracts on Information, Decision and Control.