Network Neighborhood Analysis For Detecting Anomalies in Time Series of Graphs

NETWORK NEIGHBORHOOD ANALYSIS FOR DETECTING ANOMALIES IN TIME SERIES OF GRAPHS Suchismita Goswami, PhD George Mason University, 2019 Dissertation Director: Dr. Igor Griva Around terabytes of unstructured electronic data are generated every day from twitter networks, scientific collaborations, organizational emails, telephone calls and websites. Excessive communications in communication networks, particularly in organizational e-mail networks, continue to be a major problem. In some cases, for example, Enron e-mails, frequent contact or excessive activities on interconnected networks lead to fraudulent activities. Analyzing the excessive activity in a social network is thus important to understand the behavior of individuals in subregions of a network. In a social network, anomalies can occur as a result of abrupt changes in the interactions among a group of individuals. Therefore, one needs to develop methodologies to analyze and detect excessive communications in dynamic social networks. The motivation of this research work is to investigate the excessive activities and make inferences in dynamic sub networks. In this dissertation work, I implement new methodologies and techniques to detect excessive communications, topic activities and the associated influential individuals in the dynamic networks obtained from organizational emails using scan statistics, multivariate time series models and probabilistic topic modeling. Three major contributions have been presented here to detect anomalies of dynamic networks obtained from organizational emails. At first, I develop a different approach by invoking the log-likelihood ratio as a scan statistic with overlapping and variable window sizes to rank the clusters, and devise a two-step scan process to detect the excessive activities in an organizations e-mail network as a case study. The initial step is to determine the structural stability of the e-mail count time series and perform differencing and de-seasonalizing operations to make the time series stationary, and obtain a primary cluster using a Poisson process model. I then extract neighborhood ego subnetworks around the observed primary cluster to obtain more refined cluster by invoking the graph invariant betweenness as the locality statistic using the binomial model. I demonstrate that the two-step scan statistics algorithm is more scalable in detecting excessive activity in large dynamic social networks. Secondly, I implement for the first time the multivariate time series models to detect a group of influential people and their dynamic relationships that are associated with excessive communications, which cannot be assessed using scan statistics models. For the multivariate modeling, a vector auto regressive (VAR) model has been employed in time series of subgraphs in e-mail networks constructed using the graph edit distance, as the nodes or vertices of the subgraphs are interrelated. Anomalies or excessive communications are assessed using the residual thresholds greater than three times the standard deviations, obtained from the fitted time series models. Finally, I devise a new method of detecting excessive topic activities from the unstructured text obtained from e-mail contents by combining the probabilistic topic modeling and scan statistics algorithms. Initially, I investigate the major topics discussed using the probabilistic modeling, such as latent Dirichlet allocation (LDA) modeling, then employ scan statistics to assess the excessive topic activities, which has the largest log likelihood ratio in the neighborhood of primary cluster. These analyses provide new ways of detecting the excessive communications and topic flow through the influential vertices in a dynamic network, and can be extended in other dynamic social networks to critically investigate excessive activities. Chapter 1: Introduction Anomalies, which are clusters of events or excessive or unusual activities, are common in science and technology. Some of the most commonly used methods for anomaly detection in data mining are density-based techniques such as k-nearest neighbor [KNT00] and local outlier factor [BKNS00], one class support vector machines [SPST+01], neural networks [HHWB00], cluster analysis-based outlier detection [HXD03] and ensemble techniques [LK05]. All these methods used to detect excessive activity, are mostly descriptive in nature, and not effective in making statistical inferences. In other words, these methods do not predict if these observed clusters of events are statistically significant or not [Kul79]. A very powerful statistical inference methodology that has been developed to detect the region of unusual activity in a random process and to infer the statistical significance of the observed excessive activity is scan statistics [Kul79], which is also termed as moving window analysis in the engineering literature and has mostly been used in spatial statistics and image analysis. Scan statistic is defined as a maximum or minimum of local statistics estimated from the local region of the data. Let {Xt, t ≥ 0} be a Poisson process with rate, λ, where Xt is the number of points (events) occurring in the interval [0, t). In any subinterval of [0, T) of length, w, let Yt be the number of points (events) in a window of the interval, [t, t+ w), such that Yt = Xt+w Xt. The one-dimensional continuous scan statistic, Sw, is written as [GB99]: Sw = max 0

[1]  Weizhong Zhao,et al.  A heuristic approach to determine an appropriate number of topics in topic modeling , 2015, BMC Bioinformatics.

[2]  Bernhard Pfaff,et al.  Analysis of Integrated and Cointegrated Time Series with R , 2005 .

[3]  Robert H. Shumway,et al.  Time Series Analysis and Its Applications (Springer Texts in Statistics) , 2005 .

[4]  C. Granger Some properties of time series data and their use in econometric model specification , 1981 .

[5]  David F. Hendry,et al.  Co-Integration, Error Correction, and the Econometric Analysis of Non-Stationary Data , 1993, Advanced texts in econometrics.

[6]  R. Hanneman Introduction to Social Network Methods , 2001 .

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[9]  Compositional Time Series : A First Approach , 2007 .

[10]  P. Perron,et al.  Lag Length Selection and the Construction of Unit Root Tests with Good Size and Power , 2001 .

[11]  Gerard Salton,et al.  Automatic text decomposition using text segments and text themes , 1996, HYPERTEXT '96.

[12]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[13]  Thomas L. Griffiths,et al.  Probabilistic Topic Models , 2007 .

[14]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[15]  Michael D. Porter,et al.  Network neighborhood analysis , 2010, 2010 IEEE International Conference on Intelligence and Security Informatics.

[16]  D. Stroock,et al.  Probability Theory: An Analytic View. , 1995 .

[17]  Hans-Peter Kriegel,et al.  LOF: identifying density-based local outliers , 2000, SIGMOD '00.

[18]  Earl Rennison,et al.  Galaxy of news: an approach to visualizing and understanding expansive news landscapes , 1994, UIST '94.

[19]  Murat Caner Testik,et al.  CUSUM Monitoring of First-Order Integer-Valued Autoregressive Processes of Poisson Counts , 2009 .

[20]  P. Cowpertwait,et al.  Introductory Time Series with R , 2009 .

[21]  X. Shao,et al.  Testing for Change Points in Time Series , 2010 .

[22]  David J. Marchette,et al.  Statistical inference on attributed random graphs: Fusion of graph features and content , 2010, Comput. Stat. Data Anal..

[23]  Joseph Glaz CLUSTERING OF EVENTS IN A STOCHASTIC PROCESS , 1981 .

[24]  Joseph Naus,et al.  Approximations for Distributions of Scan Statistics , 1982 .

[25]  A. I. McLeod,et al.  Distribution of the Residual Autocorrelations in Multivariate Arma Time Series Models , 1981 .

[26]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[27]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[28]  Michaël Genin,et al.  Discrete scan statistics and the generalized likelihood ratio test , 2015 .

[29]  Edward J. Wegman,et al.  A dynamic graph model for representing streaming text documents , 2008 .

[30]  T. Caliński,et al.  A dendrite method for cluster analysis , 1974 .

[31]  Hongxing He,et al.  Outlier Detection Using Replicator Neural Networks , 2002, DaWaK.

[32]  G. W. Milligan,et al.  An examination of procedures for determining the number of clusters in a data set , 1985 .

[33]  D. Marchette Scan statistics on graphs , 2012 .

[34]  Carey E. Priebe,et al.  A Spatial Scan Statistic for Stochastic Scan Partitions , 1997 .

[35]  M. Kulldorff,et al.  International Journal of Health Geographics Open Access a Scan Statistic for Continuous Data Based on the Normal Probability Model , 2022 .

[36]  S. Wallenstein,et al.  Probabilities for the Size of Largest Clusters and Smallest Intervals , 1974 .

[37]  Yi-Ting Chen,et al.  On the Robustness of Ljung-Box and McLeod-Li Q Tests: A Simulation Study , 2002 .

[38]  M. Kraetzl,et al.  Detection of abnormal change in dynamic networks , 1999, 1999 Information, Decision and Control. Data and Information Fusion Symposium, Signal Processing and Communications Symposium and Decision and Control Symposium. Proceedings (Cat. No.99EX251).

[39]  Sholom M. Weiss,et al.  Towards language independent automated learning of text categorization models , 1994, SIGIR '94.

[40]  Tetiana Stadnytska,et al.  Deterministic or Stochastic Trend Decision on the Basis of the Augmented Dickey-Fuller Test , 2010 .

[41]  Ross Sparks,et al.  Early warning CUSUM plans for surveillance of negative binomial daily disease counts , 2010 .

[42]  David J. Marchette,et al.  Scan Statistics on Enron Graphs , 2005, Comput. Math. Organ. Theory.

[43]  G. Casella,et al.  Explaining the Gibbs Sampler , 1992 .

[44]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[45]  Roger A. Sugden,et al.  Multiple Imputation for Nonresponse in Surveys , 1988 .

[46]  Brandon Pincombea,et al.  Anomaly Detection in Time Series of Graphs using ARMA Processes , 2007 .

[47]  Mark Newman,et al.  Networks: An Introduction , 2010 .

[48]  Helmut Ltkepohl,et al.  New Introduction to Multiple Time Series Analysis , 2007 .

[49]  Trevor Hastie,et al.  An Introduction to Statistical Learning , 2013, Springer Texts in Statistics.

[50]  H. Bunke,et al.  Median graphs and anomalous change detection in communication networks , 2002, Final Program and Abstracts on Information, Decision and Control.

[51]  Jussi Tolvi,et al.  Modeling Financial Time Series with S‐Plus , 2003 .

[52]  Raymond T. Ng,et al.  Distance-based outliers: algorithms and applications , 2000, The VLDB Journal.

[53]  Robert J. Adler,et al.  The Supremum of a Particular Gaussian Field , 1984 .

[54]  Valdis E. Krebs,et al.  Mapping Networks of Terrorist Cells , 2001 .

[55]  Jeffrey L. Solka,et al.  Text Data Mining: Theory and Methods , 2008, ArXiv.

[56]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[57]  E. Wegman Hyperdimensional Data Analysis Using Parallel Coordinates , 1990 .

[58]  Chris Brooks,et al.  Introductory Econometrics for Finance , 2002 .

[59]  Alexander G. Tartakovsky,et al.  Statistical methods for network surveillance , 2018 .

[60]  David L. Waltz,et al.  Classifying news stories using memory based reasoning , 1992, SIGIR '92.

[61]  B. G. Quinn,et al.  The determination of the order of an autoregression , 1979 .

[62]  C. Granger,et al.  Co-integration and error correction: representation, estimation and testing , 1987 .

[63]  Julie Beth Lovins,et al.  Development of a stemming algorithm , 1968, Mech. Transl. Comput. Linguistics.

[64]  Christian H. Weiß,et al.  Controlling correlated processes of Poisson counts , 2007, Qual. Reliab. Eng. Int..

[65]  Joseph Glaz,et al.  Expected waiting time for the visual response , 1979, Biological Cybernetics.

[66]  Tim Robertson,et al.  On Estimating a Density which is Measurable with Respect to a $\sigma$-Lattice , 1967 .

[67]  Narayanaswamy Balakrishnan,et al.  Scan Statistics and Applications , 2012 .

[68]  G. Schwert,et al.  Tests for Unit Roots: a Monte Carlo Investigation , 1988 .

[69]  Marianne Frisén,et al.  Statistical Surveillance. Optimality and Methods , 2003 .

[70]  Kenneth F. Wallis,et al.  TIME SERIES ANALYSIS OF BOUNDED ECONOMIC VARIABLES , 1987 .

[71]  A. Braverman,et al.  Application of clustering techniques to study environmental characteristics of microbialite-bearing aquatic systems , 2015 .

[72]  Xiuzhen Zhang,et al.  Anomaly detection in online social networks , 2014, Soc. Networks.

[73]  M. Kendall Theoretical Statistics , 1956, Nature.

[74]  Vipin Kumar,et al.  Feature bagging for outlier detection , 2005, KDD '05.

[75]  Bernhard Schölkopf,et al.  Estimating the Support of a High-Dimensional Distribution , 2001, Neural Computation.

[76]  Willa W. Chen,et al.  A GENERALIZED PORTMANTEAU GOODNESS-OF-FIT TEST FOR TIME SERIES MODELS , 2004, Econometric Theory.

[77]  Angel R. Martinez,et al.  Computational Statistics Handbook with MATLAB , 2001 .

[78]  Ruey S. Tsay,et al.  Multivariate Time Series Analysis: With R and Financial Applications , 2013 .

[79]  M. Kulldorff,et al.  Spatial Scan Statistics Adjusted for Multiple Clusters , 2010 .

[80]  Bernhard Pfaff,et al.  VAR, SVAR and SVEC Models: Implementation Within R Package vars , 2008 .

[81]  Christian H. Weiß,et al.  Modelling time series of counts with overdispersion , 2009, Stat. Methods Appl..

[82]  Serena Ng,et al.  Unit Root Tests in ARMA Models with Data-Dependent Methods for the Selection of the Truncation Lag , 1995 .

[83]  Helmut Lütkepohl,et al.  Applied Time Series Econometrics , 2004 .

[84]  P. Sham,et al.  A note on the calculation of empirical P values from Monte Carlo procedures. , 2002, American journal of human genetics.

[85]  Zengyou He,et al.  Discovering cluster-based local outliers , 2003, Pattern Recognit. Lett..

[86]  J. Naus The Distribution of the Size of the Maximum Cluster of Points on a Line , 1965 .

[87]  M. Kulldorff A spatial scan statistic , 1997 .

[88]  S Wallenstein,et al.  An approximation for the distribution of the scan statistic. , 1987, Statistics in medicine.

[89]  Esam Mahdi Portmanteau test statistics for seasonal serial correlation in time series models , 2016, SpringerPlus.

[90]  Peter L. Brooks,et al.  Visualizing data , 1997 .

[91]  Edward J. Wegman,et al.  Maximum Likelihood Estimation of a Unimodal Density Function , 1970 .

[92]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[93]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[94]  Robert Tibshirani,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition , 2001, Springer Series in Statistics.

[95]  B. Brodsky,et al.  Nonparametric Methods in Change Point Problems , 1993 .