Subgraph pattern matching over uncertain graphs with identity linkage uncertainty

There is a growing need for methods that can represent and query uncertain graphs. These uncertain graphs are often the result of an information extraction and integration system that attempts to extract an entity graph or a knowledge graph from multiple unstructured sources [25], [7]. Such an integration typically leads to identity uncertainty, as different data sources may use different references to the same underlying real-world entities. Integration usually also introduces additional uncertainty on node attributes and edge existence. In this paper, we propose the notion of a probabilistic entity graph (PEG), a formal model that uniformly and systematically addresses these three types of uncertainty. A PEG is a probabilistic graph model that defines a distribution over possible graphs at the entity level. We introduce a general framework for constructing a PEG given uncertain data at the reference level and develop efficient algorithms to answer subgraph pattern matching queries in this setting. Our algorithms are based on two novel ideas: context-aware path indexing and reduction by join-candidates, which drastically reduce the query search space. A comprehensive experimental evaluation shows that our approach outperforms baseline implementations by orders of magnitude.

[1]  Jian Pei,et al.  Aggregate queries on probabilistic record linkages , 2012, EDBT '12.

[2]  Jianzhong Li,et al.  Adding regular expressions to graph reachability and pattern queries , 2011, Frontiers of Computer Science.

[3]  Philip S. Yu,et al.  Graph Indexing: Tree + Delta >= Graph , 2007, VLDB.

[4]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[5]  Charu C. Aggarwal,et al.  Discovering highly reliable subgraphs in uncertain graphs , 2011, KDD.

[6]  Hannu Toivonen,et al.  Finding reliable subgraphs from large probabilistic graphs , 2008, Data Mining and Knowledge Discovery.

[7]  Philip S. Yu,et al.  Graph indexing: a frequent structure-based approach , 2004, SIGMOD '04.

[8]  Haixun Wang,et al.  Efficient Subgraph Similarity Search on Large Probabilistic Graph Databases , 2012, Proc. VLDB Endow..

[9]  Christopher Ré,et al.  Large-Scale Deduplication with Constraints Using Dedupalog , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[10]  Jianzhong Li,et al.  Graph pattern matching , 2010, Proc. VLDB Endow..

[11]  Wilfred Ng,et al.  Fg-index: towards verification-free query processing on graph databases , 2007, SIGMOD '07.

[12]  Jianzhong Li,et al.  Finding top-k maximal cliques in an uncertain graph , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[13]  Prasoon Goyal,et al.  Probabilistic Databases , 2009, Encyclopedia of Database Systems.

[14]  Dimitrios Skoutas,et al.  Efficient discovery of frequent subgraph patterns in uncertain graph databases , 2011, EDBT/ICDT '11.

[15]  Amol Deshpande,et al.  Indexing correlated probabilistic databases , 2009, SIGMOD Conference.

[16]  Ambuj K. Singh,et al.  Graphs-at-a-time: query language and access methods for graph databases , 2008, SIGMOD Conference.

[17]  Jennifer Widom,et al.  Swoosh: a generic approach to entity resolution , 2008, The VLDB Journal.

[18]  Claudia Niederée,et al.  On-the-fly entity-aware query processing in the presence of linkage , 2010, Proc. VLDB Endow..

[19]  Zoran Majkic,et al.  Probabilistic RDF , 2006, 2006 IEEE International Conference on Information Reuse & Integration.

[20]  Chengfei Liu,et al.  Query Evaluation on Probabilistic RDF Databases , 2009, WISE.

[21]  Haixun Wang,et al.  Distance-Constraint Reachability Computation in Uncertain Graphs , 2011, Proc. VLDB Endow..

[22]  References , 1971 .

[23]  Haixun Wang,et al.  Efficient subgraph search over large uncertain graphs , 2011, Proc. VLDB Endow..

[24]  Dennis Shasha,et al.  Algorithmics and applications of tree and graph searching , 2002, PODS.

[25]  Lei Chen,et al.  Continuous Subgraph Pattern Search over Certain and Uncertain Graph Streams , 2010, IEEE Transactions on Knowledge and Data Engineering.

[26]  Lei Zou,et al.  DistanceJoin: Pattern Match Query In a Large Graph Database , 2009, Proc. VLDB Endow..

[27]  Xiang Lian,et al.  Efficient query answering in probabilistic RDF graphs , 2011, SIGMOD '11.

[28]  Jianzhong Li,et al.  Mining Frequent Subgraph Patterns from Uncertain Graph Data , 2010, IEEE Transactions on Knowledge and Data Engineering.

[29]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[30]  Estevam R. Hruschka,et al.  Toward an Architecture for Never-Ending Language Learning , 2010, AAAI.

[31]  Lise Getoor,et al.  Declarative analysis of noisy information networks , 2011, 2011 IEEE 27th International Conference on Data Engineering Workshops.

[32]  Ashwin Machanavajjhala,et al.  Entity Resolution: Theory, Practice & Open Challenges , 2012, Proc. VLDB Endow..

[33]  George Kollios,et al.  k-nearest neighbors in uncertain graphs , 2010, Proc. VLDB Endow..

[34]  Ambuj K. Singh,et al.  Closure-Tree: An Index Structure for Graph Queries , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[35]  Oren Etzioni,et al.  Open Information Extraction from the Web , 2007, CACM.

[36]  Shijie Zhang,et al.  GADDI: distance index based subgraph matching in biological networks , 2009, EDBT '09.

[37]  Gerhard Weikum,et al.  WWW 2007 / Track: Semantic Web Session: Ontologies ABSTRACT YAGO: A Core of Semantic Knowledge , 2022 .

[38]  Lise Getoor,et al.  PrDB: managing and exploiting rich correlations in probabilistic databases , 2009, The VLDB Journal.

[39]  Jiawei Han,et al.  On graph query optimization in large networks , 2010, Proc. VLDB Endow..

[40]  Ben Taskar,et al.  Learning Probabilistic Models of Link Structure , 2003, J. Mach. Learn. Res..