Exploratory Analysis of Graph Data by Leveraging Domain Knowledge

Given the soaring amount of data being generated daily, graph mining tasks are becoming increasingly challenging, leading to tremendous demand for summarization techniques. Feature selection is a representative approach that simplifies a dataset by choosing features that are relevant to a specific task, such as classification, prediction, and anomaly detection. Although it can be viewed as a way to summarize a graph in terms of a few features, it is not well-defined for exploratory analysis, and it operates on a set of observations jointly rather than conditionally (i.e., feature selection from many graphs vs. selection for an input graph conditioned on other graphs). In this work, we introduce EAGLE (Exploratory Analysis of Graphs with domain knowLEdge), a novel method that creates interpretable, feature-based, and domain-specific graph summaries in a fully automatic way. That is, the same graph in different domains–e.g., social science and neuroscience–will be described via different EAGLE summaries, which automatically leverage the domain knowledge and expectations. We propose an optimization formulation that seeks to find an interpretable summary with the most representative features for the input graph so that it is: diverse, concise, domain-specific, and efficient. Extensive experiments on synthetic and real-world datasets with up to ~1M edges and ~400 features demonstrate the effectiveness and efficiency of EAGLE and its benefits over existing methods. We also show how our method can be applied to various graph mining tasks, such as classification and exploratory analysis.

[1]  Jimeng Sun,et al.  SympGraph: a framework for mining clinical notes through symptom relation graphs , 2012, KDD.

[2]  Christos Faloutsos,et al.  CatchSync: catching synchronized behavior in large directed graphs , 2014, KDD.

[3]  Danai Koutra,et al.  OPAvion: mining and visualization in large graphs , 2012, SIGMOD Conference.

[4]  Daniel A. Keim,et al.  Guiding the Exploration of Scatter Plot Data Using Motif-Based Interest Measures , 2015, 2015 Big Data Visual Analytics (BDVA).

[5]  Danai Koutra,et al.  Perseus: An Interactive Large-Scale Graph Mining and Visualization Tool , 2015, Proc. VLDB Endow..

[6]  Danai Koutra,et al.  A Graph Summarization: A Survey , 2016, ArXiv.

[7]  J. E. Kelley,et al.  The Cutting-Plane Method for Solving Convex Programs , 1960 .

[8]  Danai Koutra,et al.  Summarizing and understanding large graphs , 2014, Stat. Anal. Data Min..

[9]  O. Sporns,et al.  Complex brain networks: graph theoretical analysis of structural and functional systems , 2009, Nature Reviews Neuroscience.

[10]  D. Freedman,et al.  On the histogram as a density estimator:L2 theory , 1981 .

[11]  Michael Angstadt,et al.  Distributed effects of methylphenidate on the network structure of the resting brain: A connectomic pattern classification analysis , 2013, NeuroImage.

[12]  Michalis Faloutsos,et al.  On power-law relationships of the Internet topology , 1999, SIGCOMM '99.

[13]  David A. Bader,et al.  Approximating Betweenness Centrality , 2007, WAW.

[14]  Howard J. Hamilton,et al.  Interestingness measures for data mining: A survey , 2006, CSUR.

[15]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[16]  D. W. Scott On optimal and data based histograms , 1979 .

[17]  Danai Koutra,et al.  RolX: structural role extraction & mining in large graphs , 2012, KDD.

[18]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[19]  Francis R. Bach,et al.  Bolasso: model consistent Lasso estimation through the bootstrap , 2008, ICML '08.

[20]  Herbert A. Sturges,et al.  The Choice of a Class Interval , 1926 .

[21]  Danai Koutra,et al.  Summarizing and understanding large graphs , 2015, Stat. Anal. Data Min..

[22]  Sung-Hyuk Cha Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions , 2007 .

[23]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Yong He,et al.  Disrupted structural and functional brain connectomes in mild cognitive impairment and Alzheimer’s disease , 2014, Neuroscience Bulletin.

[25]  Robert L. Grossman,et al.  Graph-Theoretic Scagnostics , 2005, INFOVIS.