Very Large Graphs for Information Extraction (VLG) Detection and Inference in the Presence of Uncertainty

Abstract : In numerous application domains relevant to the Department of Defense and the Intelligence Community, data of interest take the form of entities and the relationships between them, and these data are commonly represented as graphs. Under the Very Large Graphs for Information Extraction effort-a one year proof-of-concept study-MIT LL developed novel techniques for anomalous subgraph detection, building on tools in the signal processing research literature. This report documents the technical results of this effort. Two datasets-a snapshot of Thompson Reuters Web of Science database and a stream of web proxy logs-were parsed, and graphs were constructed from the raw data. From the phenomena in these datasets, several algorithms were developed to model the dynamic graph behavior, including a preferential attachment mechanism with memory, a streaming filter to model a graph as a weighted average of its past connections, and a generalized linear model for graphs where connection probabilities are determined by additional side information or metadata. A set of metrics was also constructed to facilitate comparison of techniques. The study culminated in a demonstration of the algorithms on the datasets of interest, in addition to simulated data. Performance in terms of detection, estimation, and computational burden was measured according to the metrics. Among the highlights of this demonstration were the detection of emerging coauthor clusters in the Web of Science data, detection of botnet activity in the web proxy data after 15 minutes (which took 10 days to detect using state-of-the-practice techniques), and demonstration of the core algorithm on a simulated 1-billion-vertex graph using a commodity computing cluster.

[1]  Jeremy Kepner,et al.  A scalable signal processing architecture for massive graph analysis , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Patrick J. Wolfe,et al.  Toward signal processing theory for graphs and non-Euclidean data , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Bruce D. Spencer,et al.  Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks , 2013, 1305.4977.

[4]  Mark S Handcock,et al.  MODELING SOCIAL NETWORKS FROM SAMPLED DATA. , 2010, The annals of applied statistics.

[5]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[6]  Jeremy Kepner,et al.  Very Large Graphs for Information Extraction (VLG). Summary of First-Year Proof-of-Concept Study , 2013 .

[7]  P. Wolfe,et al.  Anomalous subgraph detection via Sparse Principal Component Analysis , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).

[8]  Benjamin A. Miller,et al.  Efficient anomaly detection in dynamic, attributed graphs: Emerging phenomena and big data , 2013, 2013 IEEE International Conference on Intelligence and Security Informatics.

[9]  Patrick J. Wolfe,et al.  Subgraph Detection Using Eigenvector L1 Norms , 2010, NIPS.

[10]  Tamara G. Kolda,et al.  Community structure and scale-free collections of Erdös-Rényi graphs , 2011, Physical review. E, Statistical, nonlinear, and soft matter physics.

[11]  B. A. Miller,et al.  Matched filtering for subgraph detection in dynamic networks , 2011, 2011 IEEE Statistical Signal Processing Workshop (SSP).