Querying Web-Scale Information Networks Through Bounding Matching Scores

Web-scale information networks containing billions of entities are common nowadays. Querying these networks can be modeled as a subgraph matching problem. Since information networks are incomplete and noisy in nature, it is important to discover answers that match exactly as well as answers that are similar to queries. Existing graph matching algorithms usually use graph indices to improve the efficiency of query processing. For web-scale information networks, it may not be feasible to build the graph indices due to the amount of work and the memory/storage required. In this paper, we propose an efficient algorithm for finding the best k answers for a given query without precomputing graph indices. The quality of an answer is measured by a matching score that is computed online. To speed up query processing, we propose a novel technique for bounding the matching scores during the computation. By using bounds, we can efficiently prune the answers that have low qualities without having to evaluate all possible answers. The bounding technique can be implemented in a distributed environment, allowing our approach to efficiently answer the queries on web-scale information networks. We demonstrate the effectiveness and the efficiency of our approach through a series of experiments on real-world information networks. The result shows that our bounding technique can reduce the running time up to two orders of magnitude comparing to an approach that does not use bounds.

[1]  Philip S. Yu,et al.  Feature-based similarity search in graph structures , 2006, TODS.

[2]  Aart J. C. Bik,et al.  Pregel: a system for large-scale graph processing , 2010, SIGMOD Conference.

[3]  Kevin Chen-Chuan Chang,et al.  RoundTripRank: Graph-based proximity with importance and specificity? , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[4]  Moni Naor,et al.  Optimal aggregation algorithms for middleware , 2001, PODS.

[5]  Joseph M. Hellerstein,et al.  Distributed GraphLab: A Framework for Machine Learning in the Cloud , 2012, Proc. VLDB Endow..

[6]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[7]  Tianyu Wo,et al.  Distributed graph pattern matching , 2012, WWW.

[8]  Junzhou Luo,et al.  A distributed approach for top-k star queries on massive information networks , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[9]  Padmashree Ravindra,et al.  Scalable processing of flexible graph pattern queries on the cloud , 2013, WWW '13 Companion.

[10]  Gerhard Weikum,et al.  RDF-3X: a RISC-style engine for RDF , 2008, Proc. VLDB Endow..

[11]  Jinyang Li,et al.  Piccolo: Building Fast, Distributed Programs with Partitioned Tables , 2010, OSDI.

[12]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[13]  Xin Wang,et al.  Diversified Top-k Graph Pattern Matching , 2013, Proc. VLDB Endow..

[14]  Daniel J. Abadi,et al.  Scalable SPARQL querying of large RDF graphs , 2011, Proc. VLDB Endow..

[15]  Nan Li,et al.  Neighborhood based fast graph search in large networks , 2011, SIGMOD '11.

[16]  Lixin Gao,et al.  Fast Top-K Path-Based Relevance Query on Massive Graphs , 2016, IEEE Transactions on Knowledge and Data Engineering.

[17]  Ihab F. Ilyas,et al.  A survey of top-k query processing techniques in relational database systems , 2008, CSUR.

[18]  Jianzhong Li,et al.  Efficient Subgraph Matching on Billion Node Graphs , 2012, Proc. VLDB Endow..

[19]  Haixun Wang,et al.  A Distributed Graph Engine for Web Scale RDF Data , 2013, Proc. VLDB Endow..

[20]  Jiawei Han,et al.  Top-K interesting subgraph discovery in information networks , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[21]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[22]  Jignesh M. Patel,et al.  TALE: A Tool for Approximate Large Graph Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  James A. Hendler,et al.  Matrix "Bit" loaded: a scalable lightweight join query processor for RDF data , 2010, WWW '10.

[24]  Jeffrey Xu Yu,et al.  Top-k graph pattern matching over large graphs , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[25]  Takuya Akiba,et al.  Fast exact shortest-path distance queries on large networks by pruned landmark labeling , 2013, SIGMOD '13.

[26]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[27]  Rik Van de Walle,et al.  Web-Scale Querying through Linked Data Fragments , 2014, LDOW.

[28]  Yanfeng Zhang,et al.  PrIter: A Distributed Framework for Prioritizing Iterative Computations , 2011, IEEE Transactions on Parallel and Distributed Systems.

[29]  Charu C. Aggarwal,et al.  NeMa: Fast Graph Search with Label Similarity , 2013, Proc. VLDB Endow..