Very Large Graphs for Information Extraction (VLG). Summary of First-Year Proof-of-Concept Study

Abstract : In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort a one-year proof-of-concept study MIT LL developed novel techniques for anomalous subgraph detection, building on tools in the signal processing research literature. This report documents the technical results of this effort. Two datasets a snapshot of Thompson Reuters Web of Science database and a stream of web proxy logs were parsed, and graphs were constructed from the raw data. From the phenomena in these datasets, several algorithms were developed to model the dynamic graph behavior, including a preferential attachment mechanism with memory, a streaming filter to model a graph as a weighted average of its past connections, and a generalized linear model for graphs where connection probabilities are determined by additional side information or metadata. A set of metrics was also constructed to facilitate comparison of techniques. The study culminated in a demonstration of the algorithms on the datasets of interest, in addition to simulated data. Performance in terms of detection, estimation, and computational burden was measured according to the metrics. Among the highlights of this demonstration were the detection of emerging coauthor clusters in the Web of Science data, detection of botnet activity in the web proxy data after 15 minutes (which took 10 days to detect using state-of-the-practice techniques), and demonstration of the core algorithm on a simulated 1-billion-vertex graph using a commodity computing cluster.

[1]  Jeremy Kepner,et al.  Dynamic distributed dimensional data model (D4M) database and computation system , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Hisashi Kashima,et al.  Eigenspace-based anomaly detection in computer systems , 2004, KDD.

[3]  Tom Mifflin Detection theory on random graphs , 2009, 2009 12th International Conference on Information Fusion.

[4]  M. Newman,et al.  Finding community structure in networks using the eigenvectors of matrices. , 2006, Physical review. E, Statistical, nonlinear, and soft matter physics.

[5]  Patrick J. Wolfe,et al.  Subgraph Detection Using Eigenvector L1 Norms , 2010, NIPS.

[6]  Patrick J. Wolfe,et al.  Toward signal processing theory for graphs and non-Euclidean data , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  R. Tibshirani,et al.  Sparse Principal Component Analysis , 2006 .

[8]  I. Johnstone,et al.  Sparse Principal Components Analysis , 2009, 0901.4392.

[9]  P. Wolfe,et al.  Anomalous subgraph detection via Sparse Principal Component Analysis , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[10]  Jeremy Kepner,et al.  A scalable signal processing architecture for massive graph analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Benjamin A. Miller,et al.  Benchmarking parallel eigen decomposition for residuals analysis of very large graphs , 2012, 2012 IEEE Conference on High Performance Extreme Computing.

[12]  Benjamin A. Miller,et al.  A Stochastic System for Large Network Growth , 2012, IEEE Signal Processing Letters.

[13]  B. A. Miller,et al.  Matched filtering for subgraph detection in dynamic networks , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).