Spectral clustering in the weighted stochastic block model

This paper is concerned with the statistical analysis of a real-valued symmetric data matrix. We assume a weighted stochastic block model: the matrix indices, taken to represent nodes, can be partitioned into communities so that all entries corresponding to a given community pair are replicates of the same random variable. Extending results previously known only for unweighted graphs, we provide a limit theorem showing that the point cloud obtained from spectrally embedding the data matrix follows a Gaussian mixture model where each community is represented with an elliptical component. We can therefore formally evaluate how well the communities separate under different data transformations, for example, whether it is productive to "take logs". We find that performance is invariant to affine transformation of the entries, but this expected and desirable feature hinges on adaptively selecting the eigenvectors according to eigenvalue magnitude and using Gaussian clustering. We present a network anomaly detection problem with cyber-security data where the matrix of log p-values, as opposed to p-values, has both theoretical and empirical advantages.

[1]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[2]  Patrick Rubin-Delanchy,et al.  Network-wide anomaly detection via the Dirichlet process , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[3]  Joshua Neil,et al.  Detecting Localised Anomalous Behaviour in a Computer Network , 2014, IDA.

[4]  Curtis B. Storlie,et al.  Scan Statistics for the Online Detection of Locally Anomalous Subgraphs , 2013, Technometrics.

[5]  Bin Yu,et al.  Spectral clustering and the high-dimensional stochastic blockmodel , 2010, 1007.1684.

[6]  Carey E. Priebe,et al.  A statistical interpretation of spectral embedding: The generalised random dot product graph , 2017, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[7]  Z. Q. Lu Statistical Inference Based on Divergence Measures , 2007 .

[8]  D. Hand,et al.  Bayesian anomaly detection methods for social networks , 2010, 1011.1788.

[9]  Carey E. Priebe,et al.  Limit theorems for eigenvectors of the normalized Laplacian for random graphs , 2016, The Annals of Statistics.

[10]  Po-Ling Loh,et al.  Optimal rates for community estimation in the weighted stochastic block model , 2017, The Annals of Statistics.

[11]  Carey E. Priebe,et al.  Statistical Inference on Random Dot Product Graphs: a Survey , 2017, J. Mach. Learn. Res..

[12]  Juston Moore,et al.  Poisson factorization for peer-based anomaly detection , 2016, 2016 IEEE Conference on Intelligence and Security Informatics (ISI).

[13]  H. Chernoff A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations , 1952 .

[14]  Charles R. Johnson,et al.  Matrix Analysis, 2nd Ed , 2012 .

[15]  Michael I. Jordan,et al.  On Spectral Clustering: Analysis and an algorithm , 2001, NIPS.

[16]  Kathryn B. Laskey,et al.  Stochastic blockmodels: First steps , 1983 .

[17]  Melissa J. Turcotte,et al.  Time of Day Anomaly Detection , 2018, 2018 European Intelligence and Security Informatics Conference (EISIC).

[18]  Patrick Rubin-Delanchy,et al.  Choosing between methods of combining p-values , 2017, 1707.06897.

[19]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.