Top-k Graph Summarization on Hierarchical DAGs

Directed acyclic graph (DAG) is an essentially important model to represent terminologies and their hierarchical relationships, such as Disease Ontology. Due to massive terminologies and complex structures in a large DAG, it is challenging to summarize the whole hierarchical DAG. In this paper, we study a new problem of finding k representative vertices to summarize a hierarchical DAG. To depict diverse summarization and important vertices, we design a summary score function for capturing vertices' diversity coverage and structure correlation. The studied problem is theoretically proven to be NP-hard. To efficiently tackle it, we propose a greedy algorithm with an approximation guarantee, which iteratively adds vertices with the large summary contributions into answers. To further improve answer quality, we propose a subtree extraction based method, which is proven to guarantee achieving higher-quality answers. In addition, we develop a scalable algorithm k-PCGS based on candidate pruning and DAG compression for large-scale hierarchical DAGs. Extensive experiments on large real-world datasets demonstrate both the effectiveness and efficiency of proposed algorithms.

[1]  Jeffrey Xu Yu,et al.  Diversifying Top-K Results , 2012, Proc. VLDB Endow..

[2]  Xin Wang,et al.  Diversified Top-k Graph Pattern Matching , 2013, Proc. VLDB Endow..

[3]  Danai Koutra,et al.  OPAvion: mining and visualization in large graphs , 2012, SIGMOD Conference.

[4]  Petros Efstathopoulos,et al.  Utility-Driven Graph Summarization , 2018, Proc. VLDB Endow..

[5]  Divesh Srivastava,et al.  Summary graphs for relational database schemas , 2011, Proc. VLDB Endow..

[6]  Yufei Tao,et al.  Interactive Graph Search , 2019, SIGMOD Conference.

[7]  François Goasdoué,et al.  Query-Oriented Summarization of RDF Graphs , 2015, Proc. VLDB Endow..

[8]  Yi-Cheng Zhang,et al.  Solving the apparent diversity-accuracy dilemma of recommender systems , 2008, Proceedings of the National Academy of Sciences.

[9]  Jianliang Xu,et al.  Parameter-free Structural Diversity Search , 2019, WISE.

[10]  Xia Jing,et al.  Graphical methods for reducing, visualizing and analyzing large data sets using hierarchical terminologies. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[11]  Jignesh M. Patel,et al.  Efficient aggregation for graph summarization , 2008, SIGMOD Conference.

[12]  Fabian M. Suchanek,et al.  YAGO3: A Knowledge Base from Multilingual Wikipedias , 2015, CIDR.

[13]  Zhi Cai,et al.  Size-l Object Summaries for Relational Keyword Search , 2011, Proc. VLDB Endow..

[14]  Richard M. Karp,et al.  Reducibility Among Combinatorial Problems , 1972, 50 Years of Integer Programming.

[15]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[16]  J J Cimino,et al.  A complementary graphical method for reducing and analyzing large data sets. Case studies demonstrating thresholds setting and selection. , 2014, Methods of information in medicine.

[17]  Matthieu Latapy,et al.  Main-memory triangle computations for very large (sparse (power-law)) graphs , 2008, Theor. Comput. Sci..

[18]  Ryan A. Rossi,et al.  The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[19]  Zhengwei Yang,et al.  Diversified Top-k Subgraph Querying in a Large Graph , 2016, SIGMOD Conference.

[20]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[21]  Ambuj K. Singh,et al.  Answering top-k representative queries on graph databases , 2014, SIGMOD Conference.

[22]  Yanchun Zhang,et al.  Ontology-based Graph Visualization for Summarized View , 2017, CIKM.

[23]  Pankaj K. Agarwal,et al.  Finding Diverse, High-Value Representatives on a Surface of Answers , 2017, Proc. VLDB Endow..

[24]  Hong Cheng,et al.  Top-K structural diversity search in large networks , 2013, The VLDB Journal.

[25]  Nikos Mamoulis,et al.  Diverse and Proportional Size-l Object Summaries for Keyword Search , 2015, SIGMOD Conference.

[26]  Davide Martinenghi,et al.  Top-k diversity queries over bounded regions , 2013, TODS.

[27]  Lei Zou,et al.  Fast and Accurate Graph Stream Summarization , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[28]  Lijun Chang,et al.  Diversified top-k clique search , 2015, The VLDB Journal.